[
https://issues.apache.org/jira/browse/IMPALA-2658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17145909#comment-17145909
]
ASF subversion and git services commented on IMPALA-2658:
---------------------------------------------------------
Commit eef61d22d89b97eb589936701a41d05d84b0da8a in impala's branch
refs/heads/master from Qifan Chen
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=eef61d2 ]
IMPALA-2658: Extend the NDV function to accept a precision
This work addresses the current limitation in NDV function by
extending the function to optionally take a secondary argument
called scale.
NDV([DISTINCT | ALL] expression [, scale])
Without the secondary argument, all the syntax and semantics are
preserved. The precision, which determines the total number
of different estimators in the HLL algorithm, is still 10.
When supplied, the scale argument must be an interger literal
in the range from 1 to 10. Its value is internally mapped
to a precision used by the HLL algorithm, with the following
mapping formula:
precision = scale + 8.
Thus, a scale of 1 is mapped to a precision of 9 and a scale of
10 is mapped to a precision of 18.
A large precision value generally produces a better estimation
(i.e. with less error) than a small precision value, due to extra
number of estimators involved. The expense is at the extra amount of
memory needed. For a given precision p, the amount of memory used
by the HLL algorithm is in the order of 2^p bytes.
Testing:
1. Ran unit tests against table store_sales in TPC-DS and table customer
in TPCH in both serial and parallel plan settings;
2. Added and ran a new regression test (test_ndv)) in
TestAggregationQueries section to compute NDV() for every supported
Impala data type over all valid scale values;
3. Ran "core" tests.
Performance:
1. Ran estimation error tests against a total of 22 distinct data sets
loaded into external Impala tables.
The error was computed as
abs(<true_unique_value> - <estimated_unique_value>) / <true_unique_value>.
Overall, the precision of 18 (or the scale value of 10) gave
the best result with worst estimation error at 0.42% (for one set
of 10 million integers), and average error no more than 0.17%,
at the cost of 256Kb of memory for the internal data structure per
evaluation of the HLL algorithm. Other precisions (such as 16 and
17) were also very reasonable but with slightly larger estimation
errors.
2. Ran execution time tests against a total of 6 distinct data files
on a single node EC2 VM in debug mode. These data files were loaded
in turn into a single column in an external Impala table. It was
found that the total execution time was relatively the same across
different scales for a given table configuration. It remains to be
seen the execution time for tables involving multiple data files
across multiple nodes.
3. Ran execution time tests comparing the before- and
after-enhancement version of NDV().
Change-Id: I48a4517bd0959f7021143073d37505a46c551a58
Reviewed-on: http://gerrit.cloudera.org:8080/15997
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>
> Extend the NDV function to accept a precision
> ---------------------------------------------
>
> Key: IMPALA-2658
> URL: https://issues.apache.org/jira/browse/IMPALA-2658
> Project: IMPALA
> Issue Type: Improvement
> Components: Backend
> Affects Versions: Impala 2.2.4
> Reporter: Peter Ebert
> Assignee: Qifan Chen
> Priority: Minor
> Labels: ramp-up
> Attachments: Comparison of HLL Memory usage, Query Duration and
> Accuracy.jpg
>
>
> Hyperloglog algorithm used by NDV defaults to a precision of 10. Being able
> to set this precision would have two benefits:
> # Lower precision sizes can speed up the performance, as a precision of 9 has
> 1/2 the number of registers as 10 (exponential) and may be just as accurate
> depending on expected cardinality.
> # Higher precision can help with very large cardinalities (100 million to
> billion range) and will typically provide more accurate data. Those who are
> presenting estimates to end users will likely be willing to trade some
> performance cost for more accuracy, while still out performing the naive
> approach by a large margin.
> Propose adding the overloaded function NDV(expression, int precision)
> with accepted range between 18 and 4 inclusive.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]