[Impala-ASF-CR] IMPALA-110: Support for multiple DISTINCT

Tim Armstrong (Code Review) Wed, 08 Aug 2018 17:12:56 -0700

Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10771 )


Change subject: IMPALA-110: Support for multiple DISTINCT
......................................................................


Patch Set 4:

(29 comments)

I did a pass over the backend and the tests. It looks solid to me, comments are 
mainly aimed at making it more readable for people in the future and catching 
regressions.

I'd like to understand the fe part too.

http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/aggregation-node-base.h
File be/src/exec/aggregation-node-base.h:

http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/aggregation-node-base.h@43
PS4, Line 43:   bool replicate_input_;
const


http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/aggregation-node.h
File be/src/exec/aggregation-node.h:

http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/aggregation-node.h@50
PS4, Line 50: curr_agg_idx_
optional: maybe something more explicit like curr_output_agg_idx_?


http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/aggregation-node.cc
File be/src/exec/aggregation-node.cc:

http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/aggregation-node.cc@49
PS4, Line 49: batch
maybe child_batch or input_batch to disambiguate from the minibatches


http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/aggregation-node.cc@55
PS4, Line 55:         new RowBatch(child(0)->row_desc(), state->batch_size(), 
mem_tracker()));
It's a little unfortunate that there's no way in C++ apparently to forward 
arguments to an array constructor, otherwise we could just use

unique_ptr<RowBatch[]>


http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/aggregation-node.cc@56
PS4, Line 56:     mini_batches.push_back(std::move(mini_batch));
nit: might be possible to make this more concise with something like:

  mini_batches.push_back(make_unique<RowBatch>(child(0)->row_desc(), 
state->batch_size(), mem_tracker())));


http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/aggregator.h
File be/src/exec/aggregator.h:

http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/aggregator.h@81
PS4, Line 81:   /// Used to insert input rows if this is a streaming pre-agg. 
Tries to aggregate all of
nit: add blank lines between methods with comments (existing code violated this 
style)


http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/aggregator.h@93
PS4, Line 93:   virtual bool eos() = 0;
This is a bit pedantic (so I'm ok if you ignore) but I think eos() and 
num_grouping_exprs() should be Eos() and GetNumGroupingExprs() according to the 
style guide since they're not really accessors: 
https://google.github.io/styleguide/cppguide.html#Function_Names

Or we could add a eos_ member and make this a true accessor.


http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/aggregator.h@110
PS4, Line 110:   int agg_idx_;
const? we're not that consistent about doing this in the code but I find it 
helpful to understand the lifetime of members.


http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/grouping-aggregator.h
File be/src/exec/grouping-aggregator.h:

http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/grouping-aggregator.h@295
PS4, Line 295: input
"input batch"


http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/grouping-aggregator.h@296
PS4, Line 296:   /// is fully processed, 'streaming_eos_' will be true.
Might be worth mentioning that these are specifically required for the 
multi-aggregator case.


http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/grouping-aggregator.h@298
PS4, Line 298:   bool streaming_eos_;
Could we eliminate streaming_eos_ if we used streaming_idx_ == 
in_batch->num_rows() to mean the same thing? It seems to fit.


http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/grouping-aggregator.cc
File be/src/exec/grouping-aggregator.cc:

http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/grouping-aggregator.cc@118
PS4, Line 118:     streaming_idx_(0),
consider initialising these in the header - i personally find it more readable.


http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/streaming-aggregation-node.h
File be/src/exec/streaming-aggregation-node.h:

http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/streaming-aggregation-node.h@69
PS4, Line 69:   int32_t replicate_agg_idx_;
These need comments.

child_batch_eos_ is also potentially confusing. Wonder if there's a better 
name, e.g. 'child_batch_processing_done_' or something more elegant than that.


http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/streaming-aggregation-node.cc
File be/src/exec/streaming-aggregation-node.cc:

http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/streaming-aggregation-node.cc@37
PS4, Line 37:     child_eos_(false),
Same comment about initializing the constants in the header.


http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/streaming-aggregation-node.cc@79
PS4, Line 79:     for (int i = 0; i < aggs_.size(); ++i) {
It might be worth adding logic like curr_agg_idx_ in AggregationNode to linear 
search here. I'd expect the number of aggs to be small in the common case but 
we definitely see extreme queries sometimes so it can pay to be a little 
defensive.


http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/streaming-aggregation-node.cc@91
PS4, Line 91:   COUNTER_SET(rows_returned_counter_, num_rows_returned_);
It looks like we're updating num_rows_returned_ twice


http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/streaming-aggregation-node.cc@139
PS4, Line 139:         DCHECK(eos);
This control flow might be clearer if it was converted into a for loop, i.e.

      for (;replicate_agg_idx_ < aggs_.size(); ++replicate_agg_idx_) {
        RETURN_IF_ERROR(aggs_[replicate_agg_idx_]->AddBatchStreaming(
            state, out_batch, child_batch_.get(), &eos));
        if (out_batch->AtCapacity()) break;
        DCHECK(eos);
      }


http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exec/streaming-aggregation-node.cc@162
PS4, Line 162:           DCHECK(eos);
Might help to have a concise explanation for the DCHECK


http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exprs/valid-tuple-id.cc
File be/src/exprs/valid-tuple-id.cc:

http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/exprs/valid-tuple-id.cc@44
PS4, Line 44: #ifndef NDEBUG
Consider removing the #ifndef by making the calculation a static helper used in 
the DCHECK, .e.g


  DCHECK_EQ(1, ComputeNonNullCount(row, tuple_idx_.size()))


http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/runtime/row-batch.h
File be/src/runtime/row-batch.h:

http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/runtime/row-batch.h@246
PS4, Line 246:     int RowNum() { return (row_ - parent_->tuple_ptrs_) / 
num_tuples_per_row_; }
Might be worth mentioning that this does an integer division and therefore 
shouldn't be use in hot inner loops.


http://gerrit.cloudera.org:8080/#/c/10771/4/be/src/runtime/row-batch.h@340
PS4, Line 340: Clear
Maybe ClearTuplePointers()? Clear() makes me think it resets num_rows_ to 0


http://gerrit.cloudera.org:8080/#/c/10771/4/testdata/workloads/functional-planner/queries/PlannerTest/distinct.test
File testdata/workloads/functional-planner/queries/PlannerTest/distinct.test:

http://gerrit.cloudera.org:8080/#/c/10771/4/testdata/workloads/functional-planner/queries/PlannerTest/distinct.test@582
PS4, Line 582: ABCD
?


http://gerrit.cloudera.org:8080/#/c/10771/4/testdata/workloads/functional-planner/queries/PlannerTest/multiple-distinct-limit.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/multiple-distinct-limit.test:

http://gerrit.cloudera.org:8080/#/c/10771/4/testdata/workloads/functional-planner/queries/PlannerTest/multiple-distinct-limit.test@2
PS4, Line 2: select count(distinct tinyint_col) a, min(timestamp_col) b,
trailing whitespace


http://gerrit.cloudera.org:8080/#/c/10771/4/testdata/workloads/functional-query/queries/QueryTest/multiple-distinct-aggs.test
File 
testdata/workloads/functional-query/queries/QueryTest/multiple-distinct-aggs.test:

PS4:
I should probably think about the testing matrix further. The random query 
generator helps a lot here, but I can think of a couple of cases where we might 
want a basic end-to-end functional test:

* Slightly more complex plans, e.g multiple-distinct aggregation in a subquery
* Slightly more complex expressions, e.g. count(distinct ) with a non-trivial 
input expression and count(distinct ) inside a non-trivial expression
* Subplan support - running a multi-agg inside a subplan
* Handling of nulls (alltypes doesn't have nulls).

I think mostly the current implementation is clean and orthogonal enough that 
existing testing provides code coverage, so these would be more regression 
tests.


http://gerrit.cloudera.org:8080/#/c/10771/4/testdata/workloads/functional-query/queries/QueryTest/multiple-distinct-aggs.test@4
PS4, Line 4: functional.
I think we should remove the functional. prefix, since the test framework 
should be doing the use <db>;


http://gerrit.cloudera.org:8080/#/c/10771/4/testdata/workloads/functional-query/queries/QueryTest/multiple-distinct-aggs.test@37
PS4, Line 37: from functional.alltypes group by bigint_col
I think we need an order by to make the results deterministic, here and for 
some other of the below queries, unless I'm missing something?


http://gerrit.cloudera.org:8080/#/c/10771/4/testdata/workloads/functional-query/queries/QueryTest/multiple-distinct-aggs.test@298
PS4, Line 298:
trailing whitespace


http://gerrit.cloudera.org:8080/#/c/10771/4/tests/query_test/test_aggregation.py
File tests/query_test/test_aggregation.py:

http://gerrit.cloudera.org:8080/#/c/10771/4/tests/query_test/test_aggregation.py@363
PS4, Line 363:       pytest.xfail("IMPALA-4042: count(distinct NULL) fails on a 
view, needed for kudu")
It looks like IMPALA-4042 was fixed a while back, does this still apply?


http://gerrit.cloudera.org:8080/#/c/10771/4/tests/query_test/test_aggregation.py@371
PS4, Line 371:       pytest.xfail("IMPALA-4042: count(distinct NULL) fails on a 
view, needed for kudu")
Same here



--
To view, visit http://gerrit.cloudera.org:8080/10771
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I055402eaef6d81e5f70e850d9f8a621e766830a4
Gerrit-Change-Number: 10771
Gerrit-PatchSet: 4
Gerrit-Owner: Thomas Marshall <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Michael Ho <[email protected]>
Gerrit-Reviewer: Thomas Marshall <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-Reviewer: Vuk Ercegovac <[email protected]>
Gerrit-Comment-Date: Thu, 09 Aug 2018 00:12:36 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-110: Support for multiple DISTINCT

Reply via email to