Thomas Marshall has posted comments on this change. ( http://gerrit.cloudera.org:8080/10771 )
Change subject: IMPALA-110: Support for multiple DISTINCT ...................................................................... Patch Set 5: (29 comments) I addressed everything except the comment about additional functional tests, which will take some more time. http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/aggregation-node-base.h File be/src/exec/aggregation-node-base.h: http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/aggregation-node-base.h@43 PS4, Line 43: /// AggregationNodeBase::SplitMiniBatches(). > const Done http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/aggregation-node.h File be/src/exec/aggregation-node.h: http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/aggregation-node.h@50 PS4, Line 50: > optional: maybe something more explicit like curr_output_agg_idx_? Done http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/aggregation-node.cc File be/src/exec/aggregation-node.cc: http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/aggregation-node.cc@49 PS4, Line 49: child > maybe child_batch or input_batch to disambiguate from the minibatches Done http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/aggregation-node.cc@55 PS4, Line 55: make_unique<RowBatch>(child(0)->row_desc(), state->batch_size(), mem_tracker())); > It's a little unfortunate that there's no way in C++ apparently to forward Done http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/aggregation-node.cc@56 PS4, Line 56: } > nit: might be possible to make this more concise with something like: Done http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/aggregator.h File be/src/exec/aggregator.h: http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/aggregator.h@81 PS4, Line 81: > nit: add blank lines between methods with comments (existing code violated Done http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/aggregator.h@93 PS4, Line 93: > This is a bit pedantic (so I'm ok if you ignore) but I think eos() and num_ Done http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/aggregator.h@110 PS4, Line 110: /// The index of this Aggregator within the AggregationNode. When returning output, this > const? we're not that consistent about doing this in the code but I find it Done http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/grouping-aggregator.h File be/src/exec/grouping-aggregator.h: http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/grouping-aggregator.h@295 PS4, Line 295: he ca > "input batch" Done http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/grouping-aggregator.h@296 PS4, Line 296: /// AddBatchStreaing() may already have rows passed through by another aggregator. > Might be worth mentioning that these are specifically required for the mult Done http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/grouping-aggregator.h@298 PS4, Line 298: > Could we eliminate streaming_eos_ if we used streaming_idx_ == in_batch->nu Done http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/grouping-aggregator.cc File be/src/exec/grouping-aggregator.cc: http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/grouping-aggregator.cc@118 PS4, Line 118: DCHECK_EQ(PARTITION_FANOUT, 1 << NUM_PARTITIONING_BITS); > consider initialising these in the header - i personally find it more reada Done http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/streaming-aggregation-node.h File be/src/exec/streaming-aggregation-node.h: http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/streaming-aggregation-node.h@69 PS4, Line 69: bool child_batch_processed_ = true; > These need comments. Done http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/streaming-aggregation-node.cc File be/src/exec/streaming-aggregation-node.cc: http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/streaming-aggregation-node.cc@37 PS4, Line 37: DCHECK(tnode.conjuncts.empty()) << "Preaggs have no conjuncts"; > Same comment about initializing the constants in the header. Done http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/streaming-aggregation-node.cc@79 PS4, Line 79: if (aggregator_eos) ++curr_output_agg_idx_; > It might be worth adding logic like curr_agg_idx_ in AggregationNode to lin Done http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/streaming-aggregation-node.cc@79 PS4, Line 79: if (aggregator_eos) ++curr_output_agg_idx_; > *avoid linear search Done http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/streaming-aggregation-node.cc@91 PS4, Line 91: child_batch_.reset( > It looks like we're updating num_rows_returned_ twice Done http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/streaming-aggregation-node.cc@139 PS4, Line 139: continue; > This control flow might be clearer if it was converted into a for loop, i.e I'm not sure that its that much clearer, and the logic isn't quite the same - with the for loop, replicate_agg_idx_ won't be incremented if out_batch is AtCapacity() even if eos was hit (which happens a lot at least in the tests for this path I've written, when entire batches get streamed through), which can result in extra calls to GetNext(). So, not a big deal but I would prefer to leave it as is. http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/streaming-aggregation-node.cc@162 PS4, Line 162: DCHECK(eos); > Might help to have a concise explanation for the DCHECK Done http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exprs/valid-tuple-id.cc File be/src/exprs/valid-tuple-id.cc: http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exprs/valid-tuple-id.cc@44 PS4, Line 44: int non_null_count = 0; > Consider removing the #ifndef by making the calculation a static helper use Done http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/runtime/row-batch.h File be/src/runtime/row-batch.h: http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/runtime/row-batch.h@246 PS4, Line 246: /// division and so should not be used in hot inner loops. > Might be worth mentioning that this does an integer division and therefore Done http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/runtime/row-batch.h@340 PS4, Line 340: > Maybe ClearTuplePointers()? Clear() makes me think it resets num_rows_ to 0 Done http://gerrit.cloudera.org:8080/#/c/10771/4/testdata/workloads/functional-planner/queries/PlannerTest/distinct.test File testdata/workloads/functional-planner/queries/PlannerTest/distinct.test: http://gerrit.cloudera.org:8080/#/c/10771/4/testdata/workloads/functional-planner/queries/PlannerTest/distinct.test@582 PS4, Line 582: lock > ? Done http://gerrit.cloudera.org:8080/#/c/10771/4/testdata/workloads/functional-planner/queries/PlannerTest/multiple-distinct-limit.test File testdata/workloads/functional-planner/queries/PlannerTest/multiple-distinct-limit.test: http://gerrit.cloudera.org:8080/#/c/10771/4/testdata/workloads/functional-planner/queries/PlannerTest/multiple-distinct-limit.test@2 PS4, Line 2: select count(distinct tinyint_col) a, min(timestamp_col) b, > trailing whitespace Done http://gerrit.cloudera.org:8080/#/c/10771/4/testdata/workloads/functional-query/queries/QueryTest/multiple-distinct-aggs.test File testdata/workloads/functional-query/queries/QueryTest/multiple-distinct-aggs.test: http://gerrit.cloudera.org:8080/#/c/10771/4/testdata/workloads/functional-query/queries/QueryTest/multiple-distinct-aggs.test@4 PS4, Line 4: alltypes > I think we should remove the functional. prefix, since the test framework s Done http://gerrit.cloudera.org:8080/#/c/10771/4/testdata/workloads/functional-query/queries/QueryTest/multiple-distinct-aggs.test@37 PS4, Line 37: from alltypes group by bigint_col > I think we need an order by to make the results deterministic, here and for We automatically ignore the order of results for queries that don't contain an order by. http://gerrit.cloudera.org:8080/#/c/10771/4/testdata/workloads/functional-query/queries/QueryTest/multiple-distinct-aggs.test@298 PS4, Line 298: > trailing whitespace Done http://gerrit.cloudera.org:8080/#/c/10771/4/tests/query_test/test_aggregation.py File tests/query_test/test_aggregation.py: http://gerrit.cloudera.org:8080/#/c/10771/4/tests/query_test/test_aggregation.py@363 PS4, Line 363: > It looks like IMPALA-4042 was fixed a while back, does this still apply? Done http://gerrit.cloudera.org:8080/#/c/10771/4/tests/query_test/test_aggregation.py@371 PS4, Line 371: class TestWideAggregationQueries(ImpalaTestSuite): > Same here Done -- To view, visit http://gerrit.cloudera.org:8080/10771 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I055402eaef6d81e5f70e850d9f8a621e766830a4 Gerrit-Change-Number: 10771 Gerrit-PatchSet: 5 Gerrit-Owner: Thomas Marshall <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Michael Ho <[email protected]> Gerrit-Reviewer: Thomas Marshall <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Reviewer: Vuk Ercegovac <[email protected]> Gerrit-Comment-Date: Fri, 17 Aug 2018 00:20:42 +0000 Gerrit-HasComments: Yes
