Tim Armstrong has posted comments on this change. Change subject: IMPALA-2581: LIMIT can be propagated down into some aggregations ......................................................................
Patch Set 1: (10 comments) Nice! I mainly focused on the backend, assuming Alex will also look at the FE changes. http://gerrit.cloudera.org:8080/#/c/3822/1/be/src/exec/partitioned-aggregation-node-ir.cc File be/src/exec/partitioned-aggregation-node-ir.cc: Line 130: if (group_count_ < max_groups_) { We can turn this into a one-liner without an else. if (group_count_ >= max_groups_) return Status::OK(); ++group_count_ I was thinking about whether it's worth it to try and optimise out the branch with codegen. It'd be straightforward if you a have non-inlined function call has_max_groups() or even just max_groups() and use FindAndReplaceConstant(). I think the difference in perf is probably negligible right now since AddIntermediateTuple() isn't codegen'd, so I'm ok either way Line 184: if (group_count_ >= max_groups_ && aggregate_evaluators_.empty()) { I don't think there's any advantage to cross-compiling this code, so it'd be better to do this check in the caller. PS1, Line 185: child_eos_ This is overloading this variable. I'd rather keep the invariant that child_eos_ means the child returned eos. http://gerrit.cloudera.org:8080/#/c/3822/1/be/src/exec/partitioned-aggregation-node.cc File be/src/exec/partitioned-aggregation-node.cc: PS1, Line 330: group_count_ >= max_groups_ && aggregate_evaluators_.empty() I think it would be good to factor this out into an inline function, e.g. IsDoneProcessingInput() - or some better name. Will make it easier to follow the flow of the calling functions and avoid some duplication of logic. PS1, Line 331: eos I think this is overloading the meaning of this 'eos' variable. Maybe rename it to 'done' so the name doesn't imply that there's the end of some stream. Actually, it seems like it would make more sense to do this check after batch.Reset() below or before calling GetNext() above, then just breaking out of the loop. Otherwise we're fetching an extra batch that we're not processing. Line 573: child_batch_->Reset(); // All rows from child_batch_ were processed. I think we can move the check from ProcessBatchStreaming() to either after this Reset() or before GetNext() above, instead of setting child_eos_ to break out of this loop. http://gerrit.cloudera.org:8080/#/c/3822/1/be/src/exec/partitioned-aggregation-node.h File be/src/exec/partitioned-aggregation-node.h: Line 295: // ExecNode. Document that it's always a positive value, and set to the max value of int64 if there's no limit. http://gerrit.cloudera.org:8080/#/c/3822/1/fe/src/main/java/com/cloudera/impala/planner/DistributedPlanner.java File fe/src/main/java/com/cloudera/impala/planner/DistributedPlanner.java: Line 806: if (!(node.getAggInfo().getAggregateExprs().isEmpty() This needs a comment to explain the intent. Or, maybe better, factor this check and the one below out into a shared method with a meaningful name. PS1, Line 807: || || normally goes on the previous line. PS1, Line 911: !(node.getAggInfo().getAggregateExprs().isEmpty() : || node.getAggInfo().getGroupingExprs().isEmpty() Make this a method of AggInfo? -- To view, visit http://gerrit.cloudera.org:8080/3822 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I59c5b7af7a73ccdbc5496b28eacb9b6859d202bc Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Jim Apple <[email protected]> Gerrit-Reviewer: Alex Behm <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-HasComments: Yes
