Qifan Chen has uploaded a new patch set (#13). (
http://gerrit.cloudera.org:8080/15997 )
Change subject: [WIP] IMPALA-2658: Extend the NDV function to accept a precision
......................................................................
[WIP] IMPALA-2658: Extend the NDV function to accept a precision
This work addresses the current limitation in NDV function by
extending the function to take the 2nd integer-typed argument,
which must be an abstract value in the range of 1 to 10. This
abstract value specifies a real precision value used in the HLL
algorithm for the function.
Front end work:
1. Add a new template ndv function in builtin db that takes two
arguments.
2. Verify that the 2nd argument of a NDV() is an integer literal in
[1,10];
3. A new method to implement the mapping of the abstract value to the
hll precision (the real work is TBD);
4. The length of the intermediate data type is computed based on the
actual hll precision. When the 2nd argument is missing, the length
is 1024 as in the current implementation;
5. The 2nd argument, if present, will be carried over all the way to
the BE.
Back end work:
1. Remove the hardcoded precision (10) from these functions:
AggregateFunctions::HllInit(),
AggregateFunctions::HllUpdate(),
AggregateFunctions::HllMerge(),
AggregateFunctions::HllFinalEstimate(),
AggregateFunctions::HllFinalize(),
HllEstimateBias();
2. Instead, the actual precision is computed from the
length of the intermediate data type as log2(hll_len);
3. Verify that the length of the intermediate data type is
correct according to the 2nd argument (if present).
Testing:
1 Add a regression test (test_ndv)) in TestAggregationQueries
section to computes ndv() for every supported Impala data type.
2 Run unit tests against other tables such as tpcds.store_sales
and tpch.customer in both serial and parallel plan settings.
select ndv(c_name, 1) "one", ndv(c_name, 2) two, ndv(c_name, 3) three,
ndv(c_name, 4) as four, ndv(c_name, 5) as five, ndv(c_name, 6) as six,
ndv(c_name, 7) as seven, ndv(c_name, 8) as eight, ndv(c_name, 9) as nine,
ndv(c_name, 10) as ten
from tpch.customer;
select ndv(ss_sold_time_sk, 1) "one", ndv(ss_sold_time_sk, 2) two,
ndv(ss_sold_time_sk, 3) three, ndv(ss_sold_time_sk, 4) as four,
ndv(ss_sold_time_sk, 5) as five, ndv(ss_sold_time_sk, 6) as six,
ndv(ss_sold_time_sk, 7) as seven, ndv(ss_sold_time_sk, 8) as eight,
ndv(ss_sold_time_sk, 9) as nine, ndv(ss_sold_time_sk, 10) as ten
from tpcds.store_sales;
Perf: TBD
Change-Id: I48a4517bd0959f7021143073d37505a46c551a58
---
M be/src/common/logging.h
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/aggregate-functions.h
M fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M tests/query_test/test_aggregation.py
6 files changed, 302 insertions(+), 37 deletions(-)
git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/97/15997/13
--
To view, visit http://gerrit.cloudera.org:8080/15997
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I48a4517bd0959f7021143073d37505a46c551a58
Gerrit-Change-Number: 15997
Gerrit-PatchSet: 13
Gerrit-Owner: Qifan Chen <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Qifan Chen <[email protected]>
Gerrit-Reviewer: Sahil Takiar <[email protected]>