Tim Armstrong has posted comments on this change.

Change subject: IMPALA-2581: LIMIT can be propagated down into some aggregations
......................................................................


Patch Set 1:

(4 comments)

I thought some more about this during the day and I think there are some subtle 
 issues around spilling and passthrough aggs with limits that we don't have 
test coverage for.

http://gerrit.cloudera.org:8080/#/c/3822/1/be/src/exec/partitioned-aggregation-node-ir.cc
File be/src/exec/partitioned-aggregation-node-ir.cc:

Line 130:   if (group_count_ < max_groups_) {
> We can turn this into a one-liner without an else.
This will also give incorrect results for spilling aggregations. If 
AGGREGATED_ROWS is true, this is processing a spilled row that was already 
aggregated, so will double-count that group.

I think we should add a test to test_spilling.py that exercises this code path 
(e.g. a large agg with a high limit).


Line 250:     if (group_count_ >= max_groups_) {
I think this should should actually occur before the remaining_capacity check. 
Otherwise we'll pass through rows even when we already have enough groups 
(meaning the limit isn't correctly enforced and we do unnecessary work).


http://gerrit.cloudera.org:8080/#/c/3822/1/be/src/exec/partitioned-aggregation-node.cc
File be/src/exec/partitioned-aggregation-node.cc:

PS1, Line 506: ReachedLimit
Do we actually need these limit checks for grouping aggregations now? If not, 
we should remove the redundant checks and document how the limit is enforced.


Line 519: Status PartitionedAggregationNode::GetRowsStreaming(RuntimeState* 
state,
We don't actually implement limit checks in GetRowsStreaming(), since 
previously limits were never applied.

I think in some circumstances (e.g. if it switches into streaming mode) the 
current version of the patch will return more rows than the limit. I think if 
you fix my other comment in -ir.cc this may not be the case, but it would be 
good to think through and document why the explicit limit check isn't necessary 
(if it indeed isn't).


-- 
To view, visit http://gerrit.cloudera.org:8080/3822
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I59c5b7af7a73ccdbc5496b28eacb9b6859d202bc
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Jim Apple <[email protected]>
Gerrit-Reviewer: Alex Behm <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-HasComments: Yes

Reply via email to