Thomas Marshall has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10771 )

Change subject: IMPALA-110: Support for multiple DISTINCT
......................................................................


Patch Set 5:

(29 comments)

I addressed everything except the comment about additional functional tests, 
which will take some more time.

http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/aggregation-node-base.h
File be/src/exec/aggregation-node-base.h:

http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/aggregation-node-base.h@43
PS4, Line 43:   /// AggregationNodeBase::SplitMiniBatches().
> const
Done


http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/aggregation-node.h
File be/src/exec/aggregation-node.h:

http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/aggregation-node.h@50
PS4, Line 50:
> optional: maybe something more explicit like curr_output_agg_idx_?
Done


http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/aggregation-node.cc
File be/src/exec/aggregation-node.cc:

http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/aggregation-node.cc@49
PS4, Line 49: child
> maybe child_batch or input_batch to disambiguate from the minibatches
Done


http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/aggregation-node.cc@55
PS4, Line 55:         make_unique<RowBatch>(child(0)->row_desc(), 
state->batch_size(), mem_tracker()));
> It's a little unfortunate that there's no way in C++ apparently to forward
Done


http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/aggregation-node.cc@56
PS4, Line 56:   }
> nit: might be possible to make this more concise with something like:
Done


http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/aggregator.h
File be/src/exec/aggregator.h:

http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/aggregator.h@81
PS4, Line 81:
> nit: add blank lines between methods with comments (existing code violated
Done


http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/aggregator.h@93
PS4, Line 93: 
> This is a bit pedantic (so I'm ok if you ignore) but I think eos() and num_
Done


http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/aggregator.h@110
PS4, Line 110:   /// The index of this Aggregator within the AggregationNode. 
When returning output, this
> const? we're not that consistent about doing this in the code but I find it
Done


http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/grouping-aggregator.h
File be/src/exec/grouping-aggregator.h:

http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/grouping-aggregator.h@295
PS4, Line 295: he ca
> "input batch"
Done


http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/grouping-aggregator.h@296
PS4, Line 296:   /// AddBatchStreaing() may already have rows passed through by 
another aggregator.
> Might be worth mentioning that these are specifically required for the mult
Done


http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/grouping-aggregator.h@298
PS4, Line 298:
> Could we eliminate streaming_eos_ if we used streaming_idx_ == in_batch->nu
Done


http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/grouping-aggregator.cc
File be/src/exec/grouping-aggregator.cc:

http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/grouping-aggregator.cc@118
PS4, Line 118:   DCHECK_EQ(PARTITION_FANOUT, 1 << NUM_PARTITIONING_BITS);
> consider initialising these in the header - i personally find it more reada
Done


http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/streaming-aggregation-node.h
File be/src/exec/streaming-aggregation-node.h:

http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/streaming-aggregation-node.h@69
PS4, Line 69:   bool child_batch_processed_ = true;
> These need comments.
Done


http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/streaming-aggregation-node.cc
File be/src/exec/streaming-aggregation-node.cc:

http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/streaming-aggregation-node.cc@37
PS4, Line 37:   DCHECK(tnode.conjuncts.empty()) << "Preaggs have no conjuncts";
> Same comment about initializing the constants in the header.
Done


http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/streaming-aggregation-node.cc@79
PS4, Line 79:     if (aggregator_eos) ++curr_output_agg_idx_;
> It might be worth adding logic like curr_agg_idx_ in AggregationNode to lin
Done


http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/streaming-aggregation-node.cc@79
PS4, Line 79:     if (aggregator_eos) ++curr_output_agg_idx_;
> *avoid linear search
Done


http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/streaming-aggregation-node.cc@91
PS4, Line 91:     child_batch_.reset(
> It looks like we're updating num_rows_returned_ twice
Done


http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/streaming-aggregation-node.cc@139
PS4, Line 139:       continue;
> This control flow might be clearer if it was converted into a for loop, i.e
I'm not sure that its that much clearer, and the logic isn't quite the same - 
with the for loop, replicate_agg_idx_ won't be incremented if out_batch is 
AtCapacity() even if eos was hit (which happens a lot at least in the tests for 
this path I've written, when entire batches get streamed through), which can 
result in extra calls to GetNext(). So, not a big deal but I would prefer to 
leave it as is.


http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/streaming-aggregation-node.cc@162
PS4, Line 162:           DCHECK(eos);
> Might help to have a concise explanation for the DCHECK
Done


http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exprs/valid-tuple-id.cc
File be/src/exprs/valid-tuple-id.cc:

http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exprs/valid-tuple-id.cc@44
PS4, Line 44:   int non_null_count = 0;
> Consider removing the #ifndef by making the calculation a static helper use
Done


http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/runtime/row-batch.h
File be/src/runtime/row-batch.h:

http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/runtime/row-batch.h@246
PS4, Line 246:     /// division and so should not be used in hot inner loops.
> Might be worth mentioning that this does an integer division and therefore
Done


http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/runtime/row-batch.h@340
PS4, Line 340:
> Maybe ClearTuplePointers()? Clear() makes me think it resets num_rows_ to 0
Done


http://gerrit.cloudera.org:8080/#/c/10771/4/testdata/workloads/functional-planner/queries/PlannerTest/distinct.test
File testdata/workloads/functional-planner/queries/PlannerTest/distinct.test:

http://gerrit.cloudera.org:8080/#/c/10771/4/testdata/workloads/functional-planner/queries/PlannerTest/distinct.test@582
PS4, Line 582: lock
> ?
Done


http://gerrit.cloudera.org:8080/#/c/10771/4/testdata/workloads/functional-planner/queries/PlannerTest/multiple-distinct-limit.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/multiple-distinct-limit.test:

http://gerrit.cloudera.org:8080/#/c/10771/4/testdata/workloads/functional-planner/queries/PlannerTest/multiple-distinct-limit.test@2
PS4, Line 2: select count(distinct tinyint_col) a, min(timestamp_col) b,
> trailing whitespace
Done


http://gerrit.cloudera.org:8080/#/c/10771/4/testdata/workloads/functional-query/queries/QueryTest/multiple-distinct-aggs.test
File 
testdata/workloads/functional-query/queries/QueryTest/multiple-distinct-aggs.test:

http://gerrit.cloudera.org:8080/#/c/10771/4/testdata/workloads/functional-query/queries/QueryTest/multiple-distinct-aggs.test@4
PS4, Line 4: alltypes
> I think we should remove the functional. prefix, since the test framework s
Done


http://gerrit.cloudera.org:8080/#/c/10771/4/testdata/workloads/functional-query/queries/QueryTest/multiple-distinct-aggs.test@37
PS4, Line 37: from alltypes group by bigint_col
> I think we need an order by to make the results deterministic, here and for
We automatically ignore the order of results for queries that don't contain an 
order by.


http://gerrit.cloudera.org:8080/#/c/10771/4/testdata/workloads/functional-query/queries/QueryTest/multiple-distinct-aggs.test@298
PS4, Line 298:
> trailing whitespace
Done


http://gerrit.cloudera.org:8080/#/c/10771/4/tests/query_test/test_aggregation.py
File tests/query_test/test_aggregation.py:

http://gerrit.cloudera.org:8080/#/c/10771/4/tests/query_test/test_aggregation.py@363
PS4, Line 363:
> It looks like IMPALA-4042 was fixed a while back, does this still apply?
Done


http://gerrit.cloudera.org:8080/#/c/10771/4/tests/query_test/test_aggregation.py@371
PS4, Line 371: class TestWideAggregationQueries(ImpalaTestSuite):
> Same here
Done



--
To view, visit http://gerrit.cloudera.org:8080/10771
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I055402eaef6d81e5f70e850d9f8a621e766830a4
Gerrit-Change-Number: 10771
Gerrit-PatchSet: 5
Gerrit-Owner: Thomas Marshall <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Michael Ho <[email protected]>
Gerrit-Reviewer: Thomas Marshall <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-Reviewer: Vuk Ercegovac <[email protected]>
Gerrit-Comment-Date: Fri, 17 Aug 2018 00:20:42 +0000
Gerrit-HasComments: Yes

Reply via email to