[ 
https://issues.apache.org/jira/browse/IMPALA-8214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16773560#comment-16773560
 ] 

ASF subversion and git services commented on IMPALA-8214:
---------------------------------------------------------

Commit c659b78198a767b91c293cbaf77f5c8b269fba39 in impala's branch 
refs/heads/master from Tim Armstrong
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=c659b78 ]

IMPALA-8214: Fix bad plan in load_nested.py

The previous plan had the larger input on the build side of the join and
did a broadcast join, which is very suboptimal.

This speeds up data loading on my minicluster - 18s vs 31s and has a
more significant impact on a real cluster, where queries execute
much faster, the memory requirement is significantly reduced and
the data loading can potentially be broken up into fewer chunks.

I also considered computing stats on the table to let Impala generate
the same plan, but this achieves the same goal more efficiently.

Testing:
Run core tests. Resource estimates in planner tests changed slightly
because of the different distribution of data.

Change-Id: I55e0ca09590a90ba530efe4e8f8bf587dde3eeeb
Reviewed-on: http://gerrit.cloudera.org:8080/12519
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> Bad plan in load_nested.py
> --------------------------
>
>                 Key: IMPALA-8214
>                 URL: https://issues.apache.org/jira/browse/IMPALA-8214
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Infrastructure
>    Affects Versions: Impala 3.1.0
>            Reporter: Tim Armstrong
>            Assignee: Tim Armstrong
>            Priority: Major
>
> The plan for the below SQL, which is executed without stats, has the larger 
> input on the build side of the join and does a broadcast join, which is very 
> suboptimal. This causes high memory consumption when loading larger scale 
> factors, and generally makes the loading process slower than necessary. We 
> should flip the join and make it a shuffle join.
> https://github.com/apache/impala/blob/d481cd4/testdata/bin/load_nested.py#L123
> {code}
>       tmp_customer_sql = r"""
>           SELECT
>             c_custkey, c_name, c_address, c_nationkey, c_phone, c_acctbal, 
> c_mktsegment,
>             c_comment,
>             GROUP_CONCAT(
>               CONCAT(
>                 CAST(o_orderkey AS STRING), '\003',
>                 CAST(o_orderstatus AS STRING), '\003',
>                 CAST(o_totalprice AS STRING), '\003',
>                 CAST(o_orderdate AS STRING), '\003',
>                 CAST(o_orderpriority AS STRING), '\003',
>                 CAST(o_clerk AS STRING), '\003',
>                 CAST(o_shippriority AS STRING), '\003',
>                 CAST(o_comment AS STRING), '\003',
>                 CAST(lineitems_string AS STRING)
>               ), '\002'
>             ) orders_string
>           FROM {source_db}.customer
>           LEFT JOIN tmp_orders_string ON c_custkey = o_custkey
>           WHERE c_custkey % {chunks} = {chunk_idx}
>           GROUP BY 1, 2, 3, 4, 5, 6, 7, 8""".format(**sql_params)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to