Besides adding join condition, could you also check scratch directory
permission?The memory consumption is much lower than mem limit when it
throw oom. I suspect spill fails.
On Apr 27, 2016 10:05 PM, "Pradeep Nayak" <[email protected]> wrote:

> I will see if it works with specifying the join condition.
>
> On Wed, Apr 27, 2016 at 2:28 PM Tim Armstrong <[email protected]>
> wrote:
>
>> You need to specify a join condition. I.e. table_a join table_b on
>> table_a.key = table_2.key
>>
>>   table_a join table_b computes the cross-product of the two tables.
>>
>> You can confirm by running explain:
>>
>>   explain select count(*) from tab3_test_new join tab4_test_new
>>
>>
>> It will show a plan like this with a "CROSS JOIN" in it.
>>
>> [tarmstrong-box.ca.cloudera.com:21000] > explain select * from
>> functional.alltypes join functional.alltypes t2;
>> Query: explain select * from functional.alltypes join functional.alltypes
>> t2
>> +-----------------------------------------------------------+
>> | Explain String                                            |
>> +-----------------------------------------------------------+
>> | Estimated Per-Host Requirements: Memory=320.68MB VCores=2 |
>> |                                                           |
>> | 04:EXCHANGE [UNPARTITIONED]                               |
>> | |                                                         |
>> | 02:NESTED LOOP JOIN [CROSS JOIN, BROADCAST]               |
>> | |                                                         |
>> | |--03:EXCHANGE [BROADCAST]                                |
>> | |  |                                                      |
>> | |  01:SCAN HDFS [functional.alltypes t2]                  |
>> | |     partitions=24/24 files=24 size=478.45KB             |
>> | |                                                         |
>> | 00:SCAN HDFS [functional.alltypes]                        |
>> |    partitions=24/24 files=24 size=478.45KB                |
>> +-----------------------------------------------------------+
>>
>>
>> On Wed, Apr 27, 2016 at 2:21 PM, Pradeep Nayak <[email protected]>
>> wrote:
>>
>>> 40 CPU and 156GB RAM per nodemanager.
>>> We are trying to do some validation on Impala. We are using a 5 node
>>> hadoop
>>> cluster m4.10x large machines on the aws. These were setup with cloudera
>>> manager. Each node: 40 cpus' and 156GB ram.
>>>
>>> We are trying to do a count(*) on two tables with 14 million rows each.
>>> When do a count(*) individually on each of these nodes, then it works
>>> fine.
>>> However we try to do a join on these two tables and do a count(*) we hit
>>> the problem below saying memory limit exceeded.
>>>
>>> Any inputs here are appreciated ? Should we need a bigger cluster for
>>> this
>>> ? How do we decide the size of the cluster.
>>>
>>>
>>> ============
>>>
>>> [ip-172-30-1-57.ec2.internal:21000] > select count(*) from tab3_test_new;
>>>
>>> Query: select count(*) from tab3_test_new
>>>
>>> +----------+
>>>
>>> | count(*) |
>>>
>>> +----------+
>>>
>>> | 14216336 |
>>>
>>> +----------+
>>>
>>> Fetched 1 row(s) in 8.29s
>>>
>>> [ip-172-30-1-57.ec2.internal:21000] > select count(*) from tab4_test_new;
>>>
>>> Query: select count(*) from tab4_test_new
>>>
>>> +----------+
>>>
>>> | count(*) |
>>>
>>> +----------+
>>>
>>> | 14987634 |
>>>
>>> +----------+
>>>
>>> Fetched 1 row(s) in 11.46s
>>>
>>> [ip-172-30-1-57.ec2.internal:21000] >
>>>
>>> ip-172-30-1-57.ec2.internal:21000] > set mem_limit=64g;
>>>
>>> MEM_LIMIT set to 64g
>>>
>>> [ip-172-30-1-57.ec2.internal:21000] > select count(*) from tab3_test_new
>>> join tab4_test_new;
>>>
>>> Query: select count(*) from tab3_test_new join tab4_test_new
>>>
>>> WARNINGS:
>>>
>>> Memory Limit Exceeded
>>>
>>> Query(b44dde4886ccafd2:cfb6b9ffcdc4e3a6) Limit: Limit=64.00 GB
>>> Consumption=80.77 MB
>>>
>>>   Fragment b44dde4886ccafd2:cfb6b9ffcdc4e3a7: Consumption=12.00 KB
>>>
>>>     AGGREGATION_NODE (id=6): Consumption=4.00 KB
>>>
>>>     EXCHANGE_NODE (id=5): Consumption=0
>>>
>>>     DataStreamRecvr: Consumption=0
>>>
>>>   Block Manager: Limit=156.00 MB Consumption=0
>>>
>>>   Fragment b44dde4886ccafd2:cfb6b9ffcdc4e3a8: Consumption=64.62 MB
>>>
>>>     AGGREGATION_NODE (id=3): Consumption=4.00 KB
>>>
>>>     NESTED_LOOP_JOIN_NODE (id=2): Consumption=64.06 MB
>>>
>>>     HDFS_SCAN_NODE (id=0): Consumption=0
>>>
>>>     EXCHANGE_NODE (id=4): Consumption=0
>>>
>>>     DataStreamRecvr: Consumption=544.00 KB
>>>
>>>     DataStreamSender: Consumption=16.00 KB
>>>
>>>   Fragment b44dde4886ccafd2:cfb6b9ffcdc4e3a9: Consumption=16.14 MB
>>>
>>>     HDFS_SCAN_NODE (id=1): Consumption=16.01 MB
>>>
>>>     DataStreamSender: Consumption=128.00 KB
>>>
>>> WARNING: The following tables are missing relevant table and/or column
>>> statistics.
>>>
>>> test_db.tab3_test_new,test_db.tab4_test_new
>>>
>>>
>>>
>>> [ip-172-30-1-57.ec2.internal:21000] >
>>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Impala Dev" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>>
> --
> You received this message because you are subscribed to the Google Groups
> "Impala Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
>

Reply via email to