Thomas Marshall has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10771 )

Change subject: IMPALA-110: Support for multiple DISTINCT
......................................................................


Patch Set 3:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/10771/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/10771/1//COMMIT_MSG@60
PS1, Line 60: - Ran hdfs/core tests
> Can we add some spilling tests too?
Done


http://gerrit.cloudera.org:8080/#/c/10771/1/be/src/exec/aggregation-node.cc
File be/src/exec/aggregation-node.cc:

http://gerrit.cloudera.org:8080/#/c/10771/1/be/src/exec/aggregation-node.cc@118
PS1, Line 118:   if (aggs_.size() > 1) row_batch->Clear();
> I feel like there's a lot of unnecessary overhead in the pre-partitioning i
Unfortunately, we can't rely on the batches being strictly ordered by agg_idx 
(even though the previous agg node will have output the batches in agg_idx 
order, there may be a hash exchange between the agg nodes), so I think the only 
way to avoid minibatches would be to pass every batch into every aggregator and 
have each one check every row to see if its for that aggregator.

That might be better with small numbers of aggregators (though of course note 
there's a fast path above to not make minibatches if there's only a single 
aggregator), but it might not scale to a large number of aggregators as well.

I put together a prototype and did some quick and dirty experiments on my local 
machine: for a table with 10m rows, 6 columns (all with ndv 10m), running a 
simple select with 12 distinct groupings, average e2e running time without 
minibatches was 123s vs. about 115s with minibatches (with about 10s avg time 
spent splitting the minibatches per fragment instance)


http://gerrit.cloudera.org:8080/#/c/10771/1/be/src/exec/aggregator.h
File be/src/exec/aggregator.h:

http://gerrit.cloudera.org:8080/#/c/10771/1/be/src/exec/aggregator.h@77
PS1, Line 77:   virtual void Close(RuntimeState* state);
> 'eos' signals that out_batch is the last batch to be returned? Does this av
Right, it signals whether child_batch was fully processed or if we ran out of 
space in out_batch. Added a comment.


http://gerrit.cloudera.org:8080/#/c/10771/1/be/src/exec/grouping-aggregator-ir.cc
File be/src/exec/grouping-aggregator-ir.cc:

http://gerrit.cloudera.org:8080/#/c/10771/1/be/src/exec/grouping-aggregator-ir.cc@171
PS1, Line 171:       const uint32_t hash = expr_vals_cache->CurExprValuesHash();
> Let's hoist the store to 'streaming_idx_' out of the loop.
Done


http://gerrit.cloudera.org:8080/#/c/10771/1/be/src/exec/grouping-aggregator-ir.cc@190
PS1, Line 190:           streaming_idx_ = in_batch_iter.RowNum() + 1;
> This seems expensive to include in the inner loop. Can we avoid doing this
Done


http://gerrit.cloudera.org:8080/#/c/10771/1/be/src/exec/grouping-aggregator-ir.cc@191
PS1, Line 191:
> Let's hoist this load out of the loop too. Can we codegen this out for the
Done


http://gerrit.cloudera.org:8080/#/c/10771/1/be/src/exec/grouping-aggregator.h
File be/src/exec/grouping-aggregator.h:

http://gerrit.cloudera.org:8080/#/c/10771/1/be/src/exec/grouping-aggregator.h@296
PS1, Line 296:   /// is fully processed, 'streaming_eos_' will be true.
> Comments?
Done


http://gerrit.cloudera.org:8080/#/c/10771/1/be/src/exec/grouping-aggregator.cc
File be/src/exec/grouping-aggregator.cc:

http://gerrit.cloudera.org:8080/#/c/10771/1/be/src/exec/grouping-aggregator.cc@285
PS1, Line 285:
> This seems like unnecessary overhead.
Done



--
To view, visit http://gerrit.cloudera.org:8080/10771
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I055402eaef6d81e5f70e850d9f8a621e766830a4
Gerrit-Change-Number: 10771
Gerrit-PatchSet: 3
Gerrit-Owner: Thomas Marshall <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Michael Ho <[email protected]>
Gerrit-Reviewer: Thomas Marshall <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-Comment-Date: Tue, 07 Aug 2018 17:09:05 +0000
Gerrit-HasComments: Yes

Reply via email to