[
https://issues.apache.org/jira/browse/DRILL-3352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Aman Sinha updated DRILL-3352:
------------------------------
Fix Version/s: (was: 1.2.0)
1.3.0
> Extra re-distribution when evaluating window function after GROUP BY
> --------------------------------------------------------------------
>
> Key: DRILL-3352
> URL: https://issues.apache.org/jira/browse/DRILL-3352
> Project: Apache Drill
> Issue Type: Bug
> Components: Query Planning & Optimization
> Affects Versions: 1.0.0
> Reporter: Aman Sinha
> Assignee: Aman Sinha
> Labels: window_function
> Fix For: 1.3.0
>
>
> Consider the following query and plan:
> {code}
> explain plan for select min(l_partkey) over (partition by l_suppkey) from
> lineitem group by l_partkey, l_suppkey limit 1;
> 00-00 Screen
> 00-01 Project(EXPR$0=[$0])
> 00-02 SelectionVectorRemover
> 00-03 Limit(fetch=[1])
> 00-04 UnionExchange
> 01-01 Project($0=[$3])
> 01-02 Window(window#0=[window(partition {1} order by [] range
> between UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING aggs [MIN($0)])])
> 01-03 SelectionVectorRemover
> 01-04 Sort(sort0=[$1], dir0=[ASC])
> 01-05 Project(l_partkey=[$0], l_suppkey=[$1], $f2=[$2])
> 01-06 HashToRandomExchange(dist0=[[$1]])
> 02-01 UnorderedMuxExchange
> 03-01 Project(l_partkey=[$0], l_suppkey=[$1],
> $f2=[$2], E_X_P_R_H_A_S_H_F_I_E_L_D=[castInt(hash64AsDouble($1))])
> 03-02 HashAgg(group=[{0, 1}], agg#0=[MIN($2)])
> 03-03 Project(l_partkey=[$0], l_suppkey=[$1],
> $f2=[$2])
> 03-04 HashToRandomExchange(dist0=[[$0]],
> dist1=[[$1]])
> 04-01 UnorderedMuxExchange
> 05-01 Project(l_partkey=[$0],
> l_suppkey=[$1], $f2=[$2],
> E_X_P_R_H_A_S_H_F_I_E_L_D=[castInt(hash64AsDouble($1, hash64AsDouble($0)))])
> 05-02 HashAgg(group=[{0, 1}],
> agg#0=[MIN($0)])
> 05-03 Project(l_partkey=[$1],
> l_suppkey=[$0])
> 05-04
> Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath
> [path=file:/Users/asinha/data/tpch-sf1/lineitem]],
> selectionRoot=/Users/asinha/data/tpch-sf1/lineitem, numFiles=1,
> columns=[`l_partkey`, `l_suppkey`]]])
> {code}
> Here, we do a distribution for the HashAgg on 2 columns: {l_partkey,
> l_suppkey}. Subsequently, we re-distribute on {l_suppkey} only since the
> window function has a partition-by l_suppkey. The second re-distribute could
> be avoided if the first distribution for the HashAgg was done on l_suppkey
> only. The reason we do distribution on all grouping columns is to avoid
> skew problems. However, in many cases especially when a window function is
> involved, it may make sense to only distribute on 1 column.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)