[Impala-ASF-CR] IMPALA-2658: Extend the NDV function to accept a precision

Qifan Chen (Code Review) Thu, 11 Jun 2020 17:56:24 -0700

Qifan Chen has uploaded a new patch set (#26). ( 
http://gerrit.cloudera.org:8080/15997 )


Change subject: IMPALA-2658: Extend the NDV function to accept a precision
......................................................................

IMPALA-2658: Extend the NDV function to accept a precision

This work addresses the current limitation in NDV function by
extending the function to take a secondary argument. With
the work, NDV can take one of the following two forms:

1. NDV([DISTINCT | ALL] expression)
2. NDV([DISTINCT | ALL] expression, <abstract_value>)

In form 1, all the syntax and semantics are preserved.
The precision, which determines the total number of different
estimators in the HLL algorithm, is still 10.

In form 2, an extra abstract value can be supplied, which
must be an interger value in the range from 1 to 10. This value
is internally mapped to a precision with the following mapping
formula:

  precision = abstract value + 8

Thus, an abstract value of 1 is mapped to a precision of 9
and an abstract value of 10 is mapped a precision of 18.

A large precision value generally produces a better estimation
(i.e., with less error) than a small precision value, due to extra
number of estimators involved. The expense is at the extra amount of
memory needed. For a given precision p, the amount of memory used
by Hll algorithm is in the order of 2^p bytes.

The enhancement involves both the front and the back end.

In the frond end, a 2nd argument in NDV() is allowed and verified.
The following two new error messages can be raised.

1. Second parameter of NDV() must be an integer literal.
2. Second parameter of NDV() must be an integer literal in [1,10].

In the back end, the current hardcoded precision of 10 is removed. The
Hll algorithm now works on any valid precision value specified in
the HDV() function.

Testing:
1. Run unit tests against tpcds.store_sales and tpch.customer
   in both serial and parallel plan settings with the following queries:

   select ndv(c_name, 1) "one", ndv(c_name, 2) two, ndv(c_name, 3) three,
   ndv(c_name, 4) as four, ndv(c_name, 5) as five, ndv(c_name, 6) as six,
   ndv(c_name, 7) as seven, ndv(c_name, 8) as eight, ndv(c_name, 9) as nine,
   ndv(c_name, 10)  as ten
   from tpch.customer;

   select ndv(ss_sold_time_sk, 1) "one", ndv(ss_sold_time_sk, 2) two,
   ndv(ss_sold_time_sk, 3) three, ndv(ss_sold_time_sk, 4) as four,
   ndv(ss_sold_time_sk, 5) as five, ndv(ss_sold_time_sk, 6) as six,
   ndv(ss_sold_time_sk, 7) as seven, ndv(ss_sold_time_sk, 8) as eight,
   ndv(ss_sold_time_sk, 9) as nine, ndv(ss_sold_time_sk, 10) as ten
   from tpcds.store_sales;

2. Add and run a new regression test (test_ndv)) in TestAggregationQueries
   section to computes ndv() for every supported Impala data type over
   all valid abstract values;
3. Run "core" tests.

Performance:
1. Run estimation error tests against a total of 22 different data sets
   loaded into external impala tables:
    - 5 sets with 10 million unique strings
    - 5 sets with 10 million unique integers
    - 5 sets with 100 million unique strings
    - 5 sets with 97 million unique integers
    - 1 set with 499 million unique strings
    - 1 set with 450 million unique integers

   The error is computed as
   abs(<true_unique_value> - <estimated_unique_value>) / <true_unique_value>.

   Overall, the precision of 18 (or the abstract value of 10) gives
   the best result with worst estimation error at 0.42% (for one set of
   10 million integers), and average error no more than 0.17%,
   at the cost of 256Kb of memory for the internal data structure per
   evaluation of the Hll algorithm.  Other precisions (such as 16 and 17)
   are also very reasonable but with slightly larger estimation errors.

2. Run execution time tests against a total of 6 different data sets
   with different abstract values on a single node EC2 VM. The total
   execution time is relatively stable across different abstract values,
   since the number of files to aggregate from is 1. It remains to be
   seen the execution time for tables involving multiple data files
   across multiple nodes.

   - 10 million unique strings: ~3.5s
   - 10 million unique integers: ~3.34s
   - 100 million unique strings: ~5.0s
   - 97 million unique integers: ~5.0s
   - 499 million unique strings: ~22.0s
   - 450 million unique integers: ~19.0s

Change-Id: I48a4517bd0959f7021143073d37505a46c551a58
---
M be/src/common/logging.h
M be/src/exec/incr-stats-util-test.cc
M be/src/exec/incr-stats-util.cc
M be/src/exec/incr-stats-util.h
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/aggregate-functions.h
M fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M tests/query_test/test_aggregation.py
9 files changed, 375 insertions(+), 76 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/97/15997/26
--
To view, visit http://gerrit.cloudera.org:8080/15997
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I48a4517bd0959f7021143073d37505a46c551a58
Gerrit-Change-Number: 15997
Gerrit-PatchSet: 26
Gerrit-Owner: Qifan Chen <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Qifan Chen <[email protected]>
Gerrit-Reviewer: Sahil Takiar <[email protected]>

[Impala-ASF-CR] IMPALA-2658: Extend the NDV function to accept a precision

Reply via email to