You need to specify a join condition. I.e. table_a join table_b on table_a.key = table_2.key
table_a join table_b computes the cross-product of the two tables. You can confirm by running explain: explain select count(*) from tab3_test_new join tab4_test_new It will show a plan like this with a "CROSS JOIN" in it. [tarmstrong-box.ca.cloudera.com:21000] > explain select * from functional.alltypes join functional.alltypes t2; Query: explain select * from functional.alltypes join functional.alltypes t2 +-----------------------------------------------------------+ | Explain String | +-----------------------------------------------------------+ | Estimated Per-Host Requirements: Memory=320.68MB VCores=2 | | | | 04:EXCHANGE [UNPARTITIONED] | | | | | 02:NESTED LOOP JOIN [CROSS JOIN, BROADCAST] | | | | | |--03:EXCHANGE [BROADCAST] | | | | | | | 01:SCAN HDFS [functional.alltypes t2] | | | partitions=24/24 files=24 size=478.45KB | | | | | 00:SCAN HDFS [functional.alltypes] | | partitions=24/24 files=24 size=478.45KB | +-----------------------------------------------------------+ On Wed, Apr 27, 2016 at 2:21 PM, Pradeep Nayak <[email protected]> wrote: > 40 CPU and 156GB RAM per nodemanager. > We are trying to do some validation on Impala. We are using a 5 node hadoop > cluster m4.10x large machines on the aws. These were setup with cloudera > manager. Each node: 40 cpus' and 156GB ram. > > We are trying to do a count(*) on two tables with 14 million rows each. > When do a count(*) individually on each of these nodes, then it works fine. > However we try to do a join on these two tables and do a count(*) we hit > the problem below saying memory limit exceeded. > > Any inputs here are appreciated ? Should we need a bigger cluster for this > ? How do we decide the size of the cluster. > > > ============ > > [ip-172-30-1-57.ec2.internal:21000] > select count(*) from tab3_test_new; > > Query: select count(*) from tab3_test_new > > +----------+ > > | count(*) | > > +----------+ > > | 14216336 | > > +----------+ > > Fetched 1 row(s) in 8.29s > > [ip-172-30-1-57.ec2.internal:21000] > select count(*) from tab4_test_new; > > Query: select count(*) from tab4_test_new > > +----------+ > > | count(*) | > > +----------+ > > | 14987634 | > > +----------+ > > Fetched 1 row(s) in 11.46s > > [ip-172-30-1-57.ec2.internal:21000] > > > ip-172-30-1-57.ec2.internal:21000] > set mem_limit=64g; > > MEM_LIMIT set to 64g > > [ip-172-30-1-57.ec2.internal:21000] > select count(*) from tab3_test_new > join tab4_test_new; > > Query: select count(*) from tab3_test_new join tab4_test_new > > WARNINGS: > > Memory Limit Exceeded > > Query(b44dde4886ccafd2:cfb6b9ffcdc4e3a6) Limit: Limit=64.00 GB > Consumption=80.77 MB > > Fragment b44dde4886ccafd2:cfb6b9ffcdc4e3a7: Consumption=12.00 KB > > AGGREGATION_NODE (id=6): Consumption=4.00 KB > > EXCHANGE_NODE (id=5): Consumption=0 > > DataStreamRecvr: Consumption=0 > > Block Manager: Limit=156.00 MB Consumption=0 > > Fragment b44dde4886ccafd2:cfb6b9ffcdc4e3a8: Consumption=64.62 MB > > AGGREGATION_NODE (id=3): Consumption=4.00 KB > > NESTED_LOOP_JOIN_NODE (id=2): Consumption=64.06 MB > > HDFS_SCAN_NODE (id=0): Consumption=0 > > EXCHANGE_NODE (id=4): Consumption=0 > > DataStreamRecvr: Consumption=544.00 KB > > DataStreamSender: Consumption=16.00 KB > > Fragment b44dde4886ccafd2:cfb6b9ffcdc4e3a9: Consumption=16.14 MB > > HDFS_SCAN_NODE (id=1): Consumption=16.01 MB > > DataStreamSender: Consumption=128.00 KB > > WARNING: The following tables are missing relevant table and/or column > statistics. > > test_db.tab3_test_new,test_db.tab4_test_new > > > > [ip-172-30-1-57.ec2.internal:21000] > >
