40 CPU and 156GB RAM per nodemanager.
We are trying to do some validation on Impala. We are using a 5 node hadoop
cluster m4.10x large machines on the aws. These were setup with cloudera
manager. Each node: 40 cpus' and 156GB ram.

We are trying to do a count(*) on two tables with 14 million rows each.
When do a count(*) individually on each of these nodes, then it works fine.
However we try to do a join on these two tables and do a count(*) we hit
the problem below saying memory limit exceeded.

Any inputs here are appreciated ? Should we need a bigger cluster for this
? How do we decide the size of the cluster.


============

[ip-172-30-1-57.ec2.internal:21000] > select count(*) from tab3_test_new;

Query: select count(*) from tab3_test_new

+----------+

| count(*) |

+----------+

| 14216336 |

+----------+

Fetched 1 row(s) in 8.29s

[ip-172-30-1-57.ec2.internal:21000] > select count(*) from tab4_test_new;

Query: select count(*) from tab4_test_new

+----------+

| count(*) |

+----------+

| 14987634 |

+----------+

Fetched 1 row(s) in 11.46s

[ip-172-30-1-57.ec2.internal:21000] >

ip-172-30-1-57.ec2.internal:21000] > set mem_limit=64g;

MEM_LIMIT set to 64g

[ip-172-30-1-57.ec2.internal:21000] > select count(*) from tab3_test_new
join tab4_test_new;

Query: select count(*) from tab3_test_new join tab4_test_new

WARNINGS:

Memory Limit Exceeded

Query(b44dde4886ccafd2:cfb6b9ffcdc4e3a6) Limit: Limit=64.00 GB
Consumption=80.77 MB

  Fragment b44dde4886ccafd2:cfb6b9ffcdc4e3a7: Consumption=12.00 KB

    AGGREGATION_NODE (id=6): Consumption=4.00 KB

    EXCHANGE_NODE (id=5): Consumption=0

    DataStreamRecvr: Consumption=0

  Block Manager: Limit=156.00 MB Consumption=0

  Fragment b44dde4886ccafd2:cfb6b9ffcdc4e3a8: Consumption=64.62 MB

    AGGREGATION_NODE (id=3): Consumption=4.00 KB

    NESTED_LOOP_JOIN_NODE (id=2): Consumption=64.06 MB

    HDFS_SCAN_NODE (id=0): Consumption=0

    EXCHANGE_NODE (id=4): Consumption=0

    DataStreamRecvr: Consumption=544.00 KB

    DataStreamSender: Consumption=16.00 KB

  Fragment b44dde4886ccafd2:cfb6b9ffcdc4e3a9: Consumption=16.14 MB

    HDFS_SCAN_NODE (id=1): Consumption=16.01 MB

    DataStreamSender: Consumption=128.00 KB

WARNING: The following tables are missing relevant table and/or column
statistics.

test_db.tab3_test_new,test_db.tab4_test_new



[ip-172-30-1-57.ec2.internal:21000] >

Reply via email to