Alex Behm has posted comments on this change.

Change subject: IMPALA-4794: Partition distinct expr for skew data

Patch Set 1:

Commit Message:

Line 7: IMPALA-4794: Partition distinct expr for skew data
Summary msg is not very clear to me, e.g. "for data skew" should be "handling 
data skew". I suggest something like:

IMPALA-4794: Grouping distinct agg plan robust to data skew

Line 9: Currently in an aggregation with grouping and distinct expr, there
Needs cleanup for grammar and wording. Writing crisp prose is very important 
for readers.

Here are some pointers. The first sentence should describe the problem and 
high-level solution. For example:

This patch changes the query plan for grouping distinct aggregations to be more 
robust to data skew in the grouping expressions.

The existing plan ... etc.

The new plan ... etc.

Line 19: Testing: In planner test, some plans with distinct aggregation change.
Testing: Modified existing planner tests which already provide sufficient 
File fe/src/main/java/org/apache/impala/planner/

Line 861:   private PlanFragment 
createPhase2DistinctAggregationFragment(AggregationNode node,
please rename 'node' to 'phase2AggNode' for clarity

Line 864:     ArrayList<Expr> groupingExprs = 
Clean up: these groupingExprs are not needed anymore, and neither is the 
hasGrouping variable. Remove these and directly check 
node.getAggInfo().getGroupingExprs().isEmpty() in L924

Line 867:     // The first-phase aggregation node is already in the child 
first-phase -> phase-1

Line 871:     List<Expr> partitionExprs = null;
combine with L883

Line 875:     // - merge fragment, hash-partitioned on grouping and distinct 
first merge fragment

Line 876:     //   * merge agg of phase 1
let's use "phase-N" consistently (with hyphen)

Line 878:     // - merge fragment 2, partitioned on grouping exprs or 
unpartitioned with no grouping
second merge fragment

Line 887:     PlanFragment mergeFragment = null;
use firstMergeFragment and secondMergeFragment variables to make the 
distinction clearer

Line 917:     if (mergeFragment != childFragment) fragments.add(mergeFragment);
move at the end of the the else block in L895

Line 920:     // Any limit should be placed in the final merge aggregation node
I think it's clearer to cluster the code for limit setting/unsetting together, 
e.g. in L939 you can do something like:

// Limit should be applied at the final merge aggregation node

Line 922:     node.unsetLimit();
If we have grouping, we should make the phase2AggNode streaming like in L899

Line 925:       // place the merge aggregation of the 2nd phase in an 
unpartitioned fragment;
remove comment, doesn't add any value

Line 929:         node.getAggInfo().getMergeAggInfo().getGroupingExprs());
indent 4

Line 931:     // add preceding merge fragment at end
not sure what this comment means, I don't think it makes sense after your 

Line 934:       node.getAggInfo().getMergeAggInfo());
indent 4

Line 936:     // Transfer having predicates. If hasGrouping == true, the 
predicates should
adjust comment to reflect the new code

To view, visit
To unsubscribe, visit

Gerrit-MessageType: comment
Gerrit-Change-Id: I7bdada0e328b555900c7b7ff8aabc8eb15ae8fa9
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tianyi Wang <>
Gerrit-Reviewer: Alex Behm <>
Gerrit-HasComments: Yes

Reply via email to