You need to specify a join condition. I.e. table_a join table_b on
table_a.key = table_2.key

  table_a join table_b computes the cross-product of the two tables.

You can confirm by running explain:

  explain select count(*) from tab3_test_new join tab4_test_new


It will show a plan like this with a "CROSS JOIN" in it.

[tarmstrong-box.ca.cloudera.com:21000] > explain select * from
functional.alltypes join functional.alltypes t2;
Query: explain select * from functional.alltypes join functional.alltypes t2
+-----------------------------------------------------------+
| Explain String                                            |
+-----------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=320.68MB VCores=2 |
|                                                           |
| 04:EXCHANGE [UNPARTITIONED]                               |
| |                                                         |
| 02:NESTED LOOP JOIN [CROSS JOIN, BROADCAST]               |
| |                                                         |
| |--03:EXCHANGE [BROADCAST]                                |
| |  |                                                      |
| |  01:SCAN HDFS [functional.alltypes t2]                  |
| |     partitions=24/24 files=24 size=478.45KB             |
| |                                                         |
| 00:SCAN HDFS [functional.alltypes]                        |
|    partitions=24/24 files=24 size=478.45KB                |
+-----------------------------------------------------------+


On Wed, Apr 27, 2016 at 2:21 PM, Pradeep Nayak <[email protected]>
wrote:

> 40 CPU and 156GB RAM per nodemanager.
> We are trying to do some validation on Impala. We are using a 5 node hadoop
> cluster m4.10x large machines on the aws. These were setup with cloudera
> manager. Each node: 40 cpus' and 156GB ram.
>
> We are trying to do a count(*) on two tables with 14 million rows each.
> When do a count(*) individually on each of these nodes, then it works fine.
> However we try to do a join on these two tables and do a count(*) we hit
> the problem below saying memory limit exceeded.
>
> Any inputs here are appreciated ? Should we need a bigger cluster for this
> ? How do we decide the size of the cluster.
>
>
> ============
>
> [ip-172-30-1-57.ec2.internal:21000] > select count(*) from tab3_test_new;
>
> Query: select count(*) from tab3_test_new
>
> +----------+
>
> | count(*) |
>
> +----------+
>
> | 14216336 |
>
> +----------+
>
> Fetched 1 row(s) in 8.29s
>
> [ip-172-30-1-57.ec2.internal:21000] > select count(*) from tab4_test_new;
>
> Query: select count(*) from tab4_test_new
>
> +----------+
>
> | count(*) |
>
> +----------+
>
> | 14987634 |
>
> +----------+
>
> Fetched 1 row(s) in 11.46s
>
> [ip-172-30-1-57.ec2.internal:21000] >
>
> ip-172-30-1-57.ec2.internal:21000] > set mem_limit=64g;
>
> MEM_LIMIT set to 64g
>
> [ip-172-30-1-57.ec2.internal:21000] > select count(*) from tab3_test_new
> join tab4_test_new;
>
> Query: select count(*) from tab3_test_new join tab4_test_new
>
> WARNINGS:
>
> Memory Limit Exceeded
>
> Query(b44dde4886ccafd2:cfb6b9ffcdc4e3a6) Limit: Limit=64.00 GB
> Consumption=80.77 MB
>
>   Fragment b44dde4886ccafd2:cfb6b9ffcdc4e3a7: Consumption=12.00 KB
>
>     AGGREGATION_NODE (id=6): Consumption=4.00 KB
>
>     EXCHANGE_NODE (id=5): Consumption=0
>
>     DataStreamRecvr: Consumption=0
>
>   Block Manager: Limit=156.00 MB Consumption=0
>
>   Fragment b44dde4886ccafd2:cfb6b9ffcdc4e3a8: Consumption=64.62 MB
>
>     AGGREGATION_NODE (id=3): Consumption=4.00 KB
>
>     NESTED_LOOP_JOIN_NODE (id=2): Consumption=64.06 MB
>
>     HDFS_SCAN_NODE (id=0): Consumption=0
>
>     EXCHANGE_NODE (id=4): Consumption=0
>
>     DataStreamRecvr: Consumption=544.00 KB
>
>     DataStreamSender: Consumption=16.00 KB
>
>   Fragment b44dde4886ccafd2:cfb6b9ffcdc4e3a9: Consumption=16.14 MB
>
>     HDFS_SCAN_NODE (id=1): Consumption=16.01 MB
>
>     DataStreamSender: Consumption=128.00 KB
>
> WARNING: The following tables are missing relevant table and/or column
> statistics.
>
> test_db.tab3_test_new,test_db.tab4_test_new
>
>
>
> [ip-172-30-1-57.ec2.internal:21000] >
>

Reply via email to