Tim Armstrong has posted comments on this change.

Change subject: IMPALA-2581: LIMIT can be propagated down into some aggregations
......................................................................


Patch Set 1:

(10 comments)

Nice! I mainly focused on the backend, assuming Alex will also look at the FE 
changes.

http://gerrit.cloudera.org:8080/#/c/3822/1/be/src/exec/partitioned-aggregation-node-ir.cc
File be/src/exec/partitioned-aggregation-node-ir.cc:

Line 130:   if (group_count_ < max_groups_) {
We can turn this into a one-liner without an else.

if (group_count_ >= max_groups_) return Status::OK();
++group_count_

I was thinking about whether it's worth it to try and optimise out the branch 
with codegen. It'd be straightforward if you a have non-inlined function call 
has_max_groups() or even just max_groups() and use FindAndReplaceConstant(). I 
think the difference in perf is probably negligible right now since 
AddIntermediateTuple() isn't codegen'd, so I'm ok either way


Line 184:   if (group_count_ >= max_groups_ && aggregate_evaluators_.empty()) {
I don't think there's any advantage to cross-compiling this code, so it'd be 
better to do this check in the caller.


PS1, Line 185: child_eos_
This is overloading this variable. I'd rather keep the invariant that 
child_eos_ means the child returned eos.


http://gerrit.cloudera.org:8080/#/c/3822/1/be/src/exec/partitioned-aggregation-node.cc
File be/src/exec/partitioned-aggregation-node.cc:

PS1, Line 330: group_count_ >= max_groups_ && aggregate_evaluators_.empty()
I think it would be good to factor this out into an inline function, e.g. 
IsDoneProcessingInput() - or some better name. Will make it easier to follow 
the flow of the calling functions and avoid some duplication of logic.


PS1, Line 331: eos
I think this is overloading the meaning of this 'eos' variable. Maybe rename it 
to 'done' so the name doesn't imply that there's the end of some stream.

Actually, it seems like it would make more sense to do this check after 
batch.Reset() below or before calling GetNext() above, then just breaking out 
of the loop. Otherwise we're fetching an extra batch that we're not processing.


Line 573:     child_batch_->Reset(); // All rows from child_batch_ were 
processed.
I think we can move the check from ProcessBatchStreaming() to either after this 
Reset() or before GetNext() above, instead of setting child_eos_ to break out 
of this loop.


http://gerrit.cloudera.org:8080/#/c/3822/1/be/src/exec/partitioned-aggregation-node.h
File be/src/exec/partitioned-aggregation-node.h:

Line 295:   // ExecNode.
Document that it's always a positive value, and set to the max value of int64 
if there's no limit.


http://gerrit.cloudera.org:8080/#/c/3822/1/fe/src/main/java/com/cloudera/impala/planner/DistributedPlanner.java
File fe/src/main/java/com/cloudera/impala/planner/DistributedPlanner.java:

Line 806:     if (!(node.getAggInfo().getAggregateExprs().isEmpty()
This needs a comment to explain the intent. Or, maybe better, factor this check 
and the one below out into a shared method with a meaningful name.


PS1, Line 807:       || 
|| normally goes on the previous line.


PS1, Line 911: !(node.getAggInfo().getAggregateExprs().isEmpty()
             :               || node.getAggInfo().getGroupingExprs().isEmpty()
Make this a method of AggInfo?


-- 
To view, visit http://gerrit.cloudera.org:8080/3822
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I59c5b7af7a73ccdbc5496b28eacb9b6859d202bc
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Jim Apple <[email protected]>
Gerrit-Reviewer: Alex Behm <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-HasComments: Yes

Reply via email to