Qifan Chen has uploaded a new patch set (#40). ( http://gerrit.cloudera.org:8080/15997 )
Change subject: IMPALA-2658: Extend the NDV function to accept a precision ...................................................................... IMPALA-2658: Extend the NDV function to accept a precision This work addresses the current limitation in NDV function by extending the function to optionally take a secondary argument called scale. NDV([DISTINCT | ALL] expression [, scale]) Without the secondary argument, all the syntax and semantics are preserved. The precision, which determines the total number of different estimators in the HLL algorithm, is still 10. When supplied, the scale argument must be an interger literal in the range from 1 to 10. Its value is internally mapped to a precision used by the HLL algorithm, with the following mapping formula: precision = scale + 8. Thus, a scale of 1 is mapped to a precision of 9 and a scale of 10 is mapped to a precision of 18. A large precision value generally produces a better estimation (i.e. with less error) than a small precision value, due to extra number of estimators involved. The expense is at the extra amount of memory needed. For a given precision p, the amount of memory used by the HLL algorithm is in the order of 2^p bytes. Testing: 1. Ran unit tests against table store_sales in TPC-DS and table customer in TPCH in both serial and parallel plan settings; 2. Added and ran a new regression test (test_ndv)) in TestAggregationQueries section to compute NDV() for every supported Impala data type over all valid scale values; 3. Ran "core" tests. Performance: 1. Ran estimation error tests against a total of 22 distinct data sets loaded into external Impala tables. The error was computed as abs(<true_unique_value> - <estimated_unique_value>) / <true_unique_value>. Overall, the precision of 18 (or the scale value of 10) gave the best result with worst estimation error at 0.42% (for one set of 10 million integers), and average error no more than 0.17%, at the cost of 256Kb of memory for the internal data structure per evaluation of the HLL algorithm. Other precisions (such as 16 and 17) were also very reasonable but with slightly larger estimation errors. 2. Ran execution time tests against a total of 6 distinct data files on a single node EC2 VM in debug mode. These data files were loaded in turn into a single column in an external Impala table. It was found that the total execution time was relatively the same across different scales for a given table configuration. It remains to be seen the execution time for tables involving multiple data files across multiple nodes. 3. Ran execution time tests comparing the before- and after-enhancement version of NDV(). Change-Id: I48a4517bd0959f7021143073d37505a46c551a58 --- M be/src/common/logging.h M be/src/exec/incr-stats-util-test.cc M be/src/exec/incr-stats-util.cc M be/src/exec/incr-stats-util.h M be/src/exprs/aggregate-functions-ir.cc M be/src/exprs/aggregate-functions.h M fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java M tests/query_test/test_aggregation.py 9 files changed, 426 insertions(+), 82 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/97/15997/40 -- To view, visit http://gerrit.cloudera.org:8080/15997 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I48a4517bd0959f7021143073d37505a46c551a58 Gerrit-Change-Number: 15997 Gerrit-PatchSet: 40 Gerrit-Owner: Qifan Chen <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Qifan Chen <[email protected]> Gerrit-Reviewer: Sahil Takiar <[email protected]>
