[Impala-ASF-CR] IMPALA-2658: Extend the NDV function to accept a precision

Sahil Takiar (Code Review) Tue, 16 Jun 2020 14:42:21 -0700

Sahil Takiar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15997 )


Change subject: IMPALA-2658: Extend the NDV function to accept a precision
......................................................................


Patch Set 34:

(16 comments)

haven't read through all the updates, but overall looking a lot better

http://gerrit.cloudera.org:8080/#/c/15997/27//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/15997/27//COMMIT_MSG@13
PS27, Line 13:    NDV([DISTINCT | ALL] expression [, scale])
             :
> Combined the two into one form.
The new version looks good now. yeah 'scale' sounds good to me.


http://gerrit.cloudera.org:8080/#/c/15997/27//COMMIT_MSG@78
PS27, Line 78:    into a single column in an external Impala table.  It was 
found that the
             :    total execution time was relatively the same across different 
scales for
             :    a given table configuration. It remains to be seen the 
execution time for
             :    tables involving multiple data files across multiple nodes.
             :
             :    - 10 million unique string file: ~3.5s
> I don't think you really need to list these out, you don't list out the err
ping


http://gerrit.cloudera.org:8080/#/c/15997/34//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/15997/34//COMMIT_MSG@16
PS34, Line 16:
nit: extra white space


http://gerrit.cloudera.org:8080/#/c/15997/34//COMMIT_MSG@35
PS34, Line 35: The enhancement involves both the front and the back end.
             :
             : In the frond end, a 2nd parameter in NDV() is allowed and 
verified.
             : In addition, the data type of the intermediate result in the
             : plan records the correct amount of memory needed. This is 
assisted
             : by the inclusion of additional template aggregate function 
objects
             : in the built-in database.
             :
             : In the back end, the current hardcoded precision of 10 is 
removed. The
             : HLL algorithm now works with the default, or any valid precision 
values.
             : The precision value is computed from the corresponding scale 
value
             : stored in the query plan.
I think u can remove this whole section.


http://gerrit.cloudera.org:8080/#/c/15997/34//COMMIT_MSG@56
PS34, Line 56: Performance
would be good to do some quick sanity checks to make sure ndv performance did 
not regress. essentially comparing ndv performance without this patch vs ndv 
performance with this patch.


http://gerrit.cloudera.org:8080/#/c/15997/34//COMMIT_MSG@59
PS34, Line 59:     - 5 sets with 10 million unique strings
             :     - 5 sets with 10 million unique integers
             :     - 5 sets with 100 million unique strings
             :     - 5 sets with 97 million unique integers
             :     - 1 set with 499 million unique strings
             :     - 1 set with 450 million unique integers
Remove this, its too verbose. You paste a link to the google doc containing all 
the experiments you did on the JIRA directly.


http://gerrit.cloudera.org:8080/#/c/15997/34//COMMIT_MSG@83
PS34, Line 83:    - 10 million unique string file: ~3.5s
             :    - 10 million unique integer file: ~3.34s
             :    - 100 million unique string file: ~5.0s
             :    - 97 million unique integer file: ~5.0s
             :    - 499 million unique string file: ~22.0s
             :    - 450 million unique integer file: ~19.0s
Remove this, its too verbose. You paste a link to the google doc containing all 
the experiments you did on the JIRA directly.


http://gerrit.cloudera.org:8080/#/c/15997/27/be/src/exprs/aggregate-functions-ir.cc
File be/src/exprs/aggregate-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/15997/27/be/src/exprs/aggregate-functions-ir.cc@1576
PS27, Line 1576:
> Reworded to
okay - the part that confuses me is the wording about 'estimate' since it seems 
to contradict the code below:

 int64_t estimate = alpha * hll_len * hll_len * harmonic_mean;


http://gerrit.cloudera.org:8080/#/c/15997/34/be/src/exprs/aggregate-functions-ir.cc
File be/src/exprs/aggregate-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/15997/34/be/src/exprs/aggregate-functions-ir.cc@1442
PS34, Line 1442: static int ComputePrecisionFromScale(int scale) {
this should probably be marked as an 'inline' function to avoid the method call 
overhead.


http://gerrit.cloudera.org:8080/#/c/15997/34/be/src/exprs/aggregate-functions-ir.cc@1457
PS34, Line 1457: HLL
it seems the pattern in the rest of this class is to use 'Hll' instead of 'HLL' 
so maybe for this class just stick to 'Hll' to be consistent


http://gerrit.cloudera.org:8080/#/c/15997/34/be/src/exprs/aggregate-functions-ir.cc@1488
PS34, Line 1488: const int
can be a constant reference; so 'const int&' instead of 'const int'


http://gerrit.cloudera.org:8080/#/c/15997/34/be/src/exprs/aggregate-functions-ir.cc@1523
PS34, Line 1523:   const int hll_len = dst->len;
same comment as above


http://gerrit.cloudera.org:8080/#/c/15997/34/be/src/exprs/aggregate-functions-ir.cc@1556
PS34, Line 1556:
nit: extraneous newline


http://gerrit.cloudera.org:8080/#/c/15997/34/be/src/exprs/aggregate-functions-ir.cc@1611
PS34, Line 1611:
nit: extraneous new line


http://gerrit.cloudera.org:8080/#/c/15997/34/be/src/exprs/aggregate-functions-ir.cc@1613
PS34, Line 1613:
nit: extraneous new line


http://gerrit.cloudera.org:8080/#/c/15997/34/be/src/exprs/aggregate-functions-ir.cc@2486
PS34, Line 2486: Imp
what is Imp represent? implementation? I think it would be clearer if the 'Imp' 
part of the method name was removed, and we just relied on method overloading.



--
To view, visit http://gerrit.cloudera.org:8080/15997
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I48a4517bd0959f7021143073d37505a46c551a58
Gerrit-Change-Number: 15997
Gerrit-PatchSet: 34
Gerrit-Owner: Qifan Chen <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Qifan Chen <[email protected]>
Gerrit-Reviewer: Sahil Takiar <[email protected]>
Gerrit-Comment-Date: Tue, 16 Jun 2020 21:41:43 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2658: Extend the NDV function to accept a precision

Reply via email to