Alex Behm has posted comments on this change. Change subject: IMPALA-4794: Partition distinct expr for skew data ......................................................................
Patch Set 1: (19 comments) http://gerrit.cloudera.org:8080/#/c/7643/1//COMMIT_MSG Commit Message: Line 7: IMPALA-4794: Partition distinct expr for skew data Summary msg is not very clear to me, e.g. "for data skew" should be "handling data skew". I suggest something like: IMPALA-4794: Grouping distinct agg plan robust to data skew Line 9: Currently in an aggregation with grouping and distinct expr, there Needs cleanup for grammar and wording. Writing crisp prose is very important for readers. Here are some pointers. The first sentence should describe the problem and high-level solution. For example: This patch changes the query plan for grouping distinct aggregations to be more robust to data skew in the grouping expressions. The existing plan ... etc. The new plan ... etc. Line 19: Testing: In planner test, some plans with distinct aggregation change. Testing: Modified existing planner tests which already provide sufficient coverage. http://gerrit.cloudera.org:8080/#/c/7643/1/fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java File fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java: Line 861: private PlanFragment createPhase2DistinctAggregationFragment(AggregationNode node, please rename 'node' to 'phase2AggNode' for clarity Line 864: ArrayList<Expr> groupingExprs = node.getAggInfo().getGroupingExprs(); Clean up: these groupingExprs are not needed anymore, and neither is the hasGrouping variable. Remove these and directly check node.getAggInfo().getGroupingExprs().isEmpty() in L924 Line 867: // The first-phase aggregation node is already in the child fragment. first-phase -> phase-1 Line 871: List<Expr> partitionExprs = null; combine with L883 Line 875: // - merge fragment, hash-partitioned on grouping and distinct exprs: first merge fragment Line 876: // * merge agg of phase 1 let's use "phase-N" consistently (with hyphen) Line 878: // - merge fragment 2, partitioned on grouping exprs or unpartitioned with no grouping second merge fragment Line 887: PlanFragment mergeFragment = null; use firstMergeFragment and secondMergeFragment variables to make the distinction clearer Line 917: if (mergeFragment != childFragment) fragments.add(mergeFragment); move at the end of the the else block in L895 Line 920: // Any limit should be placed in the final merge aggregation node I think it's clearer to cluster the code for limit setting/unsetting together, e.g. in L939 you can do something like: // Limit should be applied at the final merge aggregation node phase2MergeAggNode.setLimit(node.getLimit()); node.unsetLimit(); Line 922: node.unsetLimit(); If we have grouping, we should make the phase2AggNode streaming like in L899 Line 925: // place the merge aggregation of the 2nd phase in an unpartitioned fragment; remove comment, doesn't add any value Line 929: node.getAggInfo().getMergeAggInfo().getGroupingExprs()); indent 4 Line 931: // add preceding merge fragment at end not sure what this comment means, I don't think it makes sense after your changes Line 934: node.getAggInfo().getMergeAggInfo()); indent 4 Line 936: // Transfer having predicates. If hasGrouping == true, the predicates should adjust comment to reflect the new code -- To view, visit http://gerrit.cloudera.org:8080/7643 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I7bdada0e328b555900c7b7ff8aabc8eb15ae8fa9 Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tianyi Wang <[email protected]> Gerrit-Reviewer: Alex Behm <[email protected]> Gerrit-HasComments: Yes
