Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/15997 )
Change subject: IMPALA-2658: Extend the NDV function to accept a precision ...................................................................... Patch Set 34: (16 comments) haven't read through all the updates, but overall looking a lot better http://gerrit.cloudera.org:8080/#/c/15997/27//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/15997/27//COMMIT_MSG@13 PS27, Line 13: NDV([DISTINCT | ALL] expression [, scale]) : > Combined the two into one form. The new version looks good now. yeah 'scale' sounds good to me. http://gerrit.cloudera.org:8080/#/c/15997/27//COMMIT_MSG@78 PS27, Line 78: into a single column in an external Impala table. It was found that the : total execution time was relatively the same across different scales for : a given table configuration. It remains to be seen the execution time for : tables involving multiple data files across multiple nodes. : : - 10 million unique string file: ~3.5s > I don't think you really need to list these out, you don't list out the err ping http://gerrit.cloudera.org:8080/#/c/15997/34//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/15997/34//COMMIT_MSG@16 PS34, Line 16: nit: extra white space http://gerrit.cloudera.org:8080/#/c/15997/34//COMMIT_MSG@35 PS34, Line 35: The enhancement involves both the front and the back end. : : In the frond end, a 2nd parameter in NDV() is allowed and verified. : In addition, the data type of the intermediate result in the : plan records the correct amount of memory needed. This is assisted : by the inclusion of additional template aggregate function objects : in the built-in database. : : In the back end, the current hardcoded precision of 10 is removed. The : HLL algorithm now works with the default, or any valid precision values. : The precision value is computed from the corresponding scale value : stored in the query plan. I think u can remove this whole section. http://gerrit.cloudera.org:8080/#/c/15997/34//COMMIT_MSG@56 PS34, Line 56: Performance would be good to do some quick sanity checks to make sure ndv performance did not regress. essentially comparing ndv performance without this patch vs ndv performance with this patch. http://gerrit.cloudera.org:8080/#/c/15997/34//COMMIT_MSG@59 PS34, Line 59: - 5 sets with 10 million unique strings : - 5 sets with 10 million unique integers : - 5 sets with 100 million unique strings : - 5 sets with 97 million unique integers : - 1 set with 499 million unique strings : - 1 set with 450 million unique integers Remove this, its too verbose. You paste a link to the google doc containing all the experiments you did on the JIRA directly. http://gerrit.cloudera.org:8080/#/c/15997/34//COMMIT_MSG@83 PS34, Line 83: - 10 million unique string file: ~3.5s : - 10 million unique integer file: ~3.34s : - 100 million unique string file: ~5.0s : - 97 million unique integer file: ~5.0s : - 499 million unique string file: ~22.0s : - 450 million unique integer file: ~19.0s Remove this, its too verbose. You paste a link to the google doc containing all the experiments you did on the JIRA directly. http://gerrit.cloudera.org:8080/#/c/15997/27/be/src/exprs/aggregate-functions-ir.cc File be/src/exprs/aggregate-functions-ir.cc: http://gerrit.cloudera.org:8080/#/c/15997/27/be/src/exprs/aggregate-functions-ir.cc@1576 PS27, Line 1576: > Reworded to okay - the part that confuses me is the wording about 'estimate' since it seems to contradict the code below: int64_t estimate = alpha * hll_len * hll_len * harmonic_mean; http://gerrit.cloudera.org:8080/#/c/15997/34/be/src/exprs/aggregate-functions-ir.cc File be/src/exprs/aggregate-functions-ir.cc: http://gerrit.cloudera.org:8080/#/c/15997/34/be/src/exprs/aggregate-functions-ir.cc@1442 PS34, Line 1442: static int ComputePrecisionFromScale(int scale) { this should probably be marked as an 'inline' function to avoid the method call overhead. http://gerrit.cloudera.org:8080/#/c/15997/34/be/src/exprs/aggregate-functions-ir.cc@1457 PS34, Line 1457: HLL it seems the pattern in the rest of this class is to use 'Hll' instead of 'HLL' so maybe for this class just stick to 'Hll' to be consistent http://gerrit.cloudera.org:8080/#/c/15997/34/be/src/exprs/aggregate-functions-ir.cc@1488 PS34, Line 1488: const int can be a constant reference; so 'const int&' instead of 'const int' http://gerrit.cloudera.org:8080/#/c/15997/34/be/src/exprs/aggregate-functions-ir.cc@1523 PS34, Line 1523: const int hll_len = dst->len; same comment as above http://gerrit.cloudera.org:8080/#/c/15997/34/be/src/exprs/aggregate-functions-ir.cc@1556 PS34, Line 1556: nit: extraneous newline http://gerrit.cloudera.org:8080/#/c/15997/34/be/src/exprs/aggregate-functions-ir.cc@1611 PS34, Line 1611: nit: extraneous new line http://gerrit.cloudera.org:8080/#/c/15997/34/be/src/exprs/aggregate-functions-ir.cc@1613 PS34, Line 1613: nit: extraneous new line http://gerrit.cloudera.org:8080/#/c/15997/34/be/src/exprs/aggregate-functions-ir.cc@2486 PS34, Line 2486: Imp what is Imp represent? implementation? I think it would be clearer if the 'Imp' part of the method name was removed, and we just relied on method overloading. -- To view, visit http://gerrit.cloudera.org:8080/15997 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I48a4517bd0959f7021143073d37505a46c551a58 Gerrit-Change-Number: 15997 Gerrit-PatchSet: 34 Gerrit-Owner: Qifan Chen <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Qifan Chen <[email protected]> Gerrit-Reviewer: Sahil Takiar <[email protected]> Gerrit-Comment-Date: Tue, 16 Jun 2020 21:41:43 +0000 Gerrit-HasComments: Yes
