[
https://issues.apache.org/jira/browse/IMPALA-8214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Work on IMPALA-8214 started by Tim Armstrong.
---------------------------------------------
> Bad plan in load_nested.py
> --------------------------
>
> Key: IMPALA-8214
> URL: https://issues.apache.org/jira/browse/IMPALA-8214
> Project: IMPALA
> Issue Type: Bug
> Components: Infrastructure
> Affects Versions: Impala 3.1.0
> Reporter: Tim Armstrong
> Assignee: Tim Armstrong
> Priority: Major
>
> The plan for the below SQL, which is executed without stats, has the larger
> input on the build side of the join and does a broadcast join, which is very
> suboptimal. This causes high memory consumption when loading larger scale
> factors, and generally makes the loading process slower than necessary. We
> should flip the join and make it a shuffle join.
> https://github.com/apache/impala/blob/d481cd4/testdata/bin/load_nested.py#L123
> {code}
> tmp_customer_sql = r"""
> SELECT
> c_custkey, c_name, c_address, c_nationkey, c_phone, c_acctbal,
> c_mktsegment,
> c_comment,
> GROUP_CONCAT(
> CONCAT(
> CAST(o_orderkey AS STRING), '\003',
> CAST(o_orderstatus AS STRING), '\003',
> CAST(o_totalprice AS STRING), '\003',
> CAST(o_orderdate AS STRING), '\003',
> CAST(o_orderpriority AS STRING), '\003',
> CAST(o_clerk AS STRING), '\003',
> CAST(o_shippriority AS STRING), '\003',
> CAST(o_comment AS STRING), '\003',
> CAST(lineitems_string AS STRING)
> ), '\002'
> ) orders_string
> FROM {source_db}.customer
> LEFT JOIN tmp_orders_string ON c_custkey = o_custkey
> WHERE c_custkey % {chunks} = {chunk_idx}
> GROUP BY 1, 2, 3, 4, 5, 6, 7, 8""".format(**sql_params)
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]