[ 
https://issues.apache.org/jira/browse/DRILL-3352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aman Sinha updated DRILL-3352:
------------------------------
    Labels: window_function  (was: )

> Extra re-distribution when evaluating window function after GROUP BY
> --------------------------------------------------------------------
>
>                 Key: DRILL-3352
>                 URL: https://issues.apache.org/jira/browse/DRILL-3352
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Query Planning & Optimization
>    Affects Versions: 1.0.0
>            Reporter: Aman Sinha
>            Assignee: Aman Sinha
>              Labels: window_function
>             Fix For: 1.2.0
>
>
> Consider the following query and plan: 
> {code}
> explain plan for select min(l_partkey) over (partition by l_suppkey) from 
> lineitem group by l_partkey, l_suppkey limit 1;
> 00-00    Screen
> 00-01      Project(EXPR$0=[$0])
> 00-02        SelectionVectorRemover
> 00-03          Limit(fetch=[1])
> 00-04            UnionExchange
> 01-01              Project($0=[$3])
> 01-02                Window(window#0=[window(partition {1} order by [] range 
> between UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING aggs [MIN($0)])])
> 01-03                  SelectionVectorRemover
> 01-04                    Sort(sort0=[$1], dir0=[ASC])
> 01-05                      Project(l_partkey=[$0], l_suppkey=[$1], $f2=[$2])
> 01-06                        HashToRandomExchange(dist0=[[$1]])
> 02-01                          UnorderedMuxExchange
> 03-01                            Project(l_partkey=[$0], l_suppkey=[$1], 
> $f2=[$2], E_X_P_R_H_A_S_H_F_I_E_L_D=[castInt(hash64AsDouble($1))])
> 03-02                              HashAgg(group=[{0, 1}], agg#0=[MIN($2)])
> 03-03                                Project(l_partkey=[$0], l_suppkey=[$1], 
> $f2=[$2])
> 03-04                                  HashToRandomExchange(dist0=[[$0]], 
> dist1=[[$1]])
> 04-01                                    UnorderedMuxExchange
> 05-01                                      Project(l_partkey=[$0], 
> l_suppkey=[$1], $f2=[$2], 
> E_X_P_R_H_A_S_H_F_I_E_L_D=[castInt(hash64AsDouble($1, hash64AsDouble($0)))])
> 05-02                                        HashAgg(group=[{0, 1}], 
> agg#0=[MIN($0)])
> 05-03                                          Project(l_partkey=[$1], 
> l_suppkey=[$0])
> 05-04                                            
> Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
> [path=file:/Users/asinha/data/tpch-sf1/lineitem]], 
> selectionRoot=/Users/asinha/data/tpch-sf1/lineitem, numFiles=1, 
> columns=[`l_partkey`, `l_suppkey`]]])
> {code}
> Here, we do a distribution for the HashAgg on 2 columns: {l_partkey, 
> l_suppkey}.  Subsequently, we re-distribute on {l_suppkey} only since the 
> window function has a partition-by l_suppkey.  The second re-distribute could 
> be avoided if the first distribution for the HashAgg was done on l_suppkey 
> only.   The reason we do distribution on all grouping columns is to avoid 
> skew problems.   However, in many cases especially when a window function is 
> involved, it may make sense to only distribute on 1 column. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to