fifteencai has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/17306


Change subject: IMPALA-10445: Adjust NDV's precision/scale by query option
......................................................................

IMPALA-10445: Adjust NDV's precision/scale by query option

In this commit, we introduced a new Query Option to control NDV's default 
HyperLogLog scale.

Since IMPALA-2658, we can trade additional memory for more accurate estimation 
by setting larger `scale`.
The scale is decided by SQL writers. However, it is a bumpy road to smoothly 
upgrade ndv by setting
scales larger than the default value of 2. Here lies 2 reasons, firstly, SQL 
writers are reluctant to
lower expectations, they prone to write ndv(id, 10) other than ndv(id, 9), 
ndv(id, 8) and so on. But
larger scales like 10 will use much more memories, especially when there are 
`group by`s with high
cardinality. So it is wiser to let cluster admin to choose appropriate scale 
instead. Secondly, The
queries are stored in a BI tool's configuration DB. Rewriting thousands of SQLs 
is a risky job.

In this commit, we introduced a new Query Option `DEFAULT_NDV_SCALE`. During to 
its runtime essence, We
can either set it before query submission according to cluster's overall load, 
or set it by placing a
default query option for dynamic resource pool.

We also modified the `Analyzer` to make sure APPX_COUNT_DISTINCT works with 
this query option. So if
needed, we can degrade the service by transforming `count (distinct id)` to 
`ndv(id, scale)`.

Implementation details:

- The default value of DEFAULT_NDV_SCALE is 2, so we won't change the default 
ndv behavior.
- We port `CountDistinctToNdv` transform logic from `SelectStmt.analyze()` to 
`ExprRewriter`, making
it compatible with further rewrite rules.
- The newly added rewrite rule `DefaultNdvScaleRule` is applied after 
`CountDistinctToNdvRule`.

Usage:

To set a default ndv scale:
```
SET DEFAULT_NDV_SCALE = 10;  -- The range is [1, 10]
```

To unset:
```
SET DEFAULT_NDV_SCALE = 2;
```

Here are test results of a typical workload (cardinality = 40,090,650):
+=====================================================================+
|   Metric    | Count Distinct |    NDV2    |    NDV5    |    NDV10   |
+---------------------------------------------------------------------+
|  Memory(GB) |       3.83     |    1.84    |    1.85    |     1.89   |
| Duration(s) |      182.89    |   30.22    |    29.72   |     29.24  |
|  ErrorRate  |        0%      |    1.8%    |    1.17%   |     0.06%  |
+=====================================================================+

Testing:
1) Added 3 unit test cases in `ExprRewriteRulesTest`.
2) Added 5 unit test cases in `ExprRewriterTest`.
3) Ran all front-end unit test, passed.
4) Added a new query-option test.

Change-Id: I1669858a6e8252e167b464586e8d0b6cb0d0bd50
---
M be/src/service/query-options-test.cc
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/analysis/Analyzer.java
M fe/src/main/java/org/apache/impala/analysis/SelectStmt.java
A fe/src/main/java/org/apache/impala/rewrite/CountDistinctToNdvRule.java
A fe/src/main/java/org/apache/impala/rewrite/DefaultNdvScaleRule.java
M fe/src/test/java/org/apache/impala/analysis/ExprRewriteRulesTest.java
M fe/src/test/java/org/apache/impala/analysis/ExprRewriterTest.java
11 files changed, 190 insertions(+), 34 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/06/17306/1
--
To view, visit http://gerrit.cloudera.org:8080/17306
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I1669858a6e8252e167b464586e8d0b6cb0d0bd50
Gerrit-Change-Number: 17306
Gerrit-PatchSet: 1
Gerrit-Owner: fifteencai <fifteen...@tencent.com>

Reply via email to