Thomas Marshall has posted comments on this change. ( http://gerrit.cloudera.org:8080/10771 )
Change subject: IMPALA-110: Support for multiple DISTINCT ...................................................................... Patch Set 3: (8 comments) http://gerrit.cloudera.org:8080/#/c/10771/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/10771/1//COMMIT_MSG@60 PS1, Line 60: - Ran hdfs/core tests > Can we add some spilling tests too? Done http://gerrit.cloudera.org:8080/#/c/10771/1/be/src/exec/aggregation-node.cc File be/src/exec/aggregation-node.cc: http://gerrit.cloudera.org:8080/#/c/10771/1/be/src/exec/aggregation-node.cc@118 PS1, Line 118: if (aggs_.size() > 1) row_batch->Clear(); > I feel like there's a lot of unnecessary overhead in the pre-partitioning i Unfortunately, we can't rely on the batches being strictly ordered by agg_idx (even though the previous agg node will have output the batches in agg_idx order, there may be a hash exchange between the agg nodes), so I think the only way to avoid minibatches would be to pass every batch into every aggregator and have each one check every row to see if its for that aggregator. That might be better with small numbers of aggregators (though of course note there's a fast path above to not make minibatches if there's only a single aggregator), but it might not scale to a large number of aggregators as well. I put together a prototype and did some quick and dirty experiments on my local machine: for a table with 10m rows, 6 columns (all with ndv 10m), running a simple select with 12 distinct groupings, average e2e running time without minibatches was 123s vs. about 115s with minibatches (with about 10s avg time spent splitting the minibatches per fragment instance) http://gerrit.cloudera.org:8080/#/c/10771/1/be/src/exec/aggregator.h File be/src/exec/aggregator.h: http://gerrit.cloudera.org:8080/#/c/10771/1/be/src/exec/aggregator.h@77 PS1, Line 77: virtual void Close(RuntimeState* state); > 'eos' signals that out_batch is the last batch to be returned? Does this av Right, it signals whether child_batch was fully processed or if we ran out of space in out_batch. Added a comment. http://gerrit.cloudera.org:8080/#/c/10771/1/be/src/exec/grouping-aggregator-ir.cc File be/src/exec/grouping-aggregator-ir.cc: http://gerrit.cloudera.org:8080/#/c/10771/1/be/src/exec/grouping-aggregator-ir.cc@171 PS1, Line 171: const uint32_t hash = expr_vals_cache->CurExprValuesHash(); > Let's hoist the store to 'streaming_idx_' out of the loop. Done http://gerrit.cloudera.org:8080/#/c/10771/1/be/src/exec/grouping-aggregator-ir.cc@190 PS1, Line 190: streaming_idx_ = in_batch_iter.RowNum() + 1; > This seems expensive to include in the inner loop. Can we avoid doing this Done http://gerrit.cloudera.org:8080/#/c/10771/1/be/src/exec/grouping-aggregator-ir.cc@191 PS1, Line 191: > Let's hoist this load out of the loop too. Can we codegen this out for the Done http://gerrit.cloudera.org:8080/#/c/10771/1/be/src/exec/grouping-aggregator.h File be/src/exec/grouping-aggregator.h: http://gerrit.cloudera.org:8080/#/c/10771/1/be/src/exec/grouping-aggregator.h@296 PS1, Line 296: /// is fully processed, 'streaming_eos_' will be true. > Comments? Done http://gerrit.cloudera.org:8080/#/c/10771/1/be/src/exec/grouping-aggregator.cc File be/src/exec/grouping-aggregator.cc: http://gerrit.cloudera.org:8080/#/c/10771/1/be/src/exec/grouping-aggregator.cc@285 PS1, Line 285: > This seems like unnecessary overhead. Done -- To view, visit http://gerrit.cloudera.org:8080/10771 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I055402eaef6d81e5f70e850d9f8a621e766830a4 Gerrit-Change-Number: 10771 Gerrit-PatchSet: 3 Gerrit-Owner: Thomas Marshall <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Michael Ho <[email protected]> Gerrit-Reviewer: Thomas Marshall <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Comment-Date: Tue, 07 Aug 2018 17:09:05 +0000 Gerrit-HasComments: Yes
