[Impala-ASF-CR] IMPALA-2658: Extend the NDV function to accept a precision

Qifan Chen (Code Review) Sun, 14 Jun 2020 10:20:15 -0700

Qifan Chen has uploaded a new patch set (#33). ( 
http://gerrit.cloudera.org:8080/15997 )


Change subject: IMPALA-2658: Extend the NDV function to accept a precision
......................................................................

IMPALA-2658: Extend the NDV function to accept a precision

This work addresses the current limitation in NDV function by
extending the function to optionally take a secondary argument
called scale.

   NDV([DISTINCT | ALL] expression [, scale])

Without the secondary argument, all the syntax and semantics are
preserved.  The precision, which determines the total number
of different estimators in the HLL algorithm, is still 10.

When supplied, the scale argument must be an interger literal
in the range from 1 to 10. Its value is internally mapped
to a precision used by the HLL algorithm, with the following
mapping formula:

  precision = scale + 8.

Thus, a scale of 1 is mapped to a precision of 9 and a scale of
10 is mapped to a precision of 18.

A large precision value generally produces a better estimation
(i.e. with less error) than a small precision value, due to extra
number of estimators involved. The expense is at the extra amount of
memory needed. For a given precision p, the amount of memory used
by the HLL algorithm is in the order of 2^p bytes.

The enhancement involves both the front and the back end.

In the frond end, a 2nd parameter in NDV() is allowed and verified.
In addition, the data type of the intermediate result in the
plan records the correct amount of memory needed. This is assisted
by the inclusion of additional template aggregate function objects
in the built-in database.

In the back end, the current hardcoded precision of 10 is removed. The
HLL algorithm now works with the default, or any valid precision values.
The precision value is computed from the corresponding scale value
stored in the query plan.

Testing:
1. Ran unit tests against tpcds.store_sales and tpch.customer in both
   serial and parallel plan settings;
2. Added and ran a new regression test (test_ndv)) in TestAggregationQueries
   section to compute NDV() for every supported Impala data type over
   all valid scale values;
3. Ran "core" tests.

Performance:
1. Ran estimation error tests against a total of 22 distinct data sets
   loaded into external Impala tables:
    - 5 sets with 10 million unique strings
    - 5 sets with 10 million unique integers
    - 5 sets with 100 million unique strings
    - 5 sets with 97 million unique integers
    - 1 set with 499 million unique strings
    - 1 set with 450 million unique integers

   The error was computed as
   abs(<true_unique_value> - <estimated_unique_value>) / <true_unique_value>.

   Overall, the precision of 18 (or the scale value of 10) gave
   the best result with worst estimation error at 0.42% (for one set of
   10 million integers), and average error no more than 0.17%,
   at the cost of 256Kb of memory for the internal data structure per
   evaluation of the HLL algorithm.  Other precisions (such as 16 and 17)
   were also very reasonable but with slightly larger estimation errors.

2. Ran execution time tests against a total of 6 distinct data files on a
   single node EC2 VM in debug mode. These data files were loaded in turn
   into a single column in an external Impala table.  It was found that the
   total execution time was relatively the same across different scales for
   a given table configuration. It remains to be seen the execution time for
   tables involving multiple data files across multiple nodes.

   - 10 million unique string file: ~3.5s
   - 10 million unique integer file: ~3.34s
   - 100 million unique string file: ~5.0s
   - 97 million unique integer file: ~5.0s
   - 499 million unique string file: ~22.0s
   - 450 million unique integer file: ~19.0s

Change-Id: I48a4517bd0959f7021143073d37505a46c551a58
---
M be/src/common/logging.h
M be/src/exec/incr-stats-util-test.cc
M be/src/exec/incr-stats-util.cc
M be/src/exec/incr-stats-util.h
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/aggregate-functions.h
M fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M tests/query_test/test_aggregation.py
9 files changed, 429 insertions(+), 82 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/97/15997/33
--
To view, visit http://gerrit.cloudera.org:8080/15997
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I48a4517bd0959f7021143073d37505a46c551a58
Gerrit-Change-Number: 15997
Gerrit-PatchSet: 33
Gerrit-Owner: Qifan Chen <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Qifan Chen <[email protected]>
Gerrit-Reviewer: Sahil Takiar <[email protected]>

[Impala-ASF-CR] IMPALA-2658: Extend the NDV function to accept a precision

Reply via email to