Besides adding join condition, could you also check scratch directory permission?The memory consumption is much lower than mem limit when it throw oom. I suspect spill fails. On Apr 27, 2016 10:05 PM, "Pradeep Nayak" <[email protected]> wrote:
> I will see if it works with specifying the join condition. > > On Wed, Apr 27, 2016 at 2:28 PM Tim Armstrong <[email protected]> > wrote: > >> You need to specify a join condition. I.e. table_a join table_b on >> table_a.key = table_2.key >> >> table_a join table_b computes the cross-product of the two tables. >> >> You can confirm by running explain: >> >> explain select count(*) from tab3_test_new join tab4_test_new >> >> >> It will show a plan like this with a "CROSS JOIN" in it. >> >> [tarmstrong-box.ca.cloudera.com:21000] > explain select * from >> functional.alltypes join functional.alltypes t2; >> Query: explain select * from functional.alltypes join functional.alltypes >> t2 >> +-----------------------------------------------------------+ >> | Explain String | >> +-----------------------------------------------------------+ >> | Estimated Per-Host Requirements: Memory=320.68MB VCores=2 | >> | | >> | 04:EXCHANGE [UNPARTITIONED] | >> | | | >> | 02:NESTED LOOP JOIN [CROSS JOIN, BROADCAST] | >> | | | >> | |--03:EXCHANGE [BROADCAST] | >> | | | | >> | | 01:SCAN HDFS [functional.alltypes t2] | >> | | partitions=24/24 files=24 size=478.45KB | >> | | | >> | 00:SCAN HDFS [functional.alltypes] | >> | partitions=24/24 files=24 size=478.45KB | >> +-----------------------------------------------------------+ >> >> >> On Wed, Apr 27, 2016 at 2:21 PM, Pradeep Nayak <[email protected]> >> wrote: >> >>> 40 CPU and 156GB RAM per nodemanager. >>> We are trying to do some validation on Impala. We are using a 5 node >>> hadoop >>> cluster m4.10x large machines on the aws. These were setup with cloudera >>> manager. Each node: 40 cpus' and 156GB ram. >>> >>> We are trying to do a count(*) on two tables with 14 million rows each. >>> When do a count(*) individually on each of these nodes, then it works >>> fine. >>> However we try to do a join on these two tables and do a count(*) we hit >>> the problem below saying memory limit exceeded. >>> >>> Any inputs here are appreciated ? Should we need a bigger cluster for >>> this >>> ? How do we decide the size of the cluster. >>> >>> >>> ============ >>> >>> [ip-172-30-1-57.ec2.internal:21000] > select count(*) from tab3_test_new; >>> >>> Query: select count(*) from tab3_test_new >>> >>> +----------+ >>> >>> | count(*) | >>> >>> +----------+ >>> >>> | 14216336 | >>> >>> +----------+ >>> >>> Fetched 1 row(s) in 8.29s >>> >>> [ip-172-30-1-57.ec2.internal:21000] > select count(*) from tab4_test_new; >>> >>> Query: select count(*) from tab4_test_new >>> >>> +----------+ >>> >>> | count(*) | >>> >>> +----------+ >>> >>> | 14987634 | >>> >>> +----------+ >>> >>> Fetched 1 row(s) in 11.46s >>> >>> [ip-172-30-1-57.ec2.internal:21000] > >>> >>> ip-172-30-1-57.ec2.internal:21000] > set mem_limit=64g; >>> >>> MEM_LIMIT set to 64g >>> >>> [ip-172-30-1-57.ec2.internal:21000] > select count(*) from tab3_test_new >>> join tab4_test_new; >>> >>> Query: select count(*) from tab3_test_new join tab4_test_new >>> >>> WARNINGS: >>> >>> Memory Limit Exceeded >>> >>> Query(b44dde4886ccafd2:cfb6b9ffcdc4e3a6) Limit: Limit=64.00 GB >>> Consumption=80.77 MB >>> >>> Fragment b44dde4886ccafd2:cfb6b9ffcdc4e3a7: Consumption=12.00 KB >>> >>> AGGREGATION_NODE (id=6): Consumption=4.00 KB >>> >>> EXCHANGE_NODE (id=5): Consumption=0 >>> >>> DataStreamRecvr: Consumption=0 >>> >>> Block Manager: Limit=156.00 MB Consumption=0 >>> >>> Fragment b44dde4886ccafd2:cfb6b9ffcdc4e3a8: Consumption=64.62 MB >>> >>> AGGREGATION_NODE (id=3): Consumption=4.00 KB >>> >>> NESTED_LOOP_JOIN_NODE (id=2): Consumption=64.06 MB >>> >>> HDFS_SCAN_NODE (id=0): Consumption=0 >>> >>> EXCHANGE_NODE (id=4): Consumption=0 >>> >>> DataStreamRecvr: Consumption=544.00 KB >>> >>> DataStreamSender: Consumption=16.00 KB >>> >>> Fragment b44dde4886ccafd2:cfb6b9ffcdc4e3a9: Consumption=16.14 MB >>> >>> HDFS_SCAN_NODE (id=1): Consumption=16.01 MB >>> >>> DataStreamSender: Consumption=128.00 KB >>> >>> WARNING: The following tables are missing relevant table and/or column >>> statistics. >>> >>> test_db.tab3_test_new,test_db.tab4_test_new >>> >>> >>> >>> [ip-172-30-1-57.ec2.internal:21000] > >>> >> >> -- >> You received this message because you are subscribed to the Google Groups >> "Impala Dev" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> > -- > You received this message because you are subscribed to the Google Groups > "Impala Dev" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. >
