fifteencai has uploaded a new patch set (#7). ( 
http://gerrit.cloudera.org:8080/17306 )

Change subject: IMPALA-10445: Adjust NDV's precision/scale by query option
......................................................................

IMPALA-10445: Adjust NDV's precision/scale by query option

We introduce a new way to control NDV's scale.

Since IMPALA-2658, we can trade additional memory for more accurate
estimation by setting larger `scale`. The scale is decided by SQL
writers. However, it is a bumpy road to smoothly upgrade ndv by setting
scales larger than the default value of 2. Here lies 2 reasons:
-Firstly, SQL writers are reluctant to lower expectations, they prone
to write ndv(id, 10) other than ndv(id, 9), ndv(id, 8) and so on. But
larger scales like 10 will use much more memories, especially when
there are `group by`s with high cardinality. So it is wiser to let
cluster admin to choose appropriate scale instead.
Secondly, The queries are stored in a BI tool's configuration DB.
Rewriting thousands of SQLs is a risky job.

In this commit, we introduced a new Query Option `DEFAULT_NDV_SCALE`.
During to its runtime essence, We can either set it before query
submission according to cluster's overall load, or set it by placing
a default query option for dynamic resource pool.

We also modified the `Analyzer` to make sure APPX_COUNT_DISTINCT
can work with this query option. So if needed, we can degrade service
by transforming `count (distinct id)` to `ndv(id, scale)`.

Implementation details:

- The default value of DEFAULT_NDV_SCALE is 2, so we won't change
the default ndv behavior.
- We port `CountDistinctToNdv` transform logic from
`SelectStmt.analyze()` to `ExprRewriter`, making it compatible with
further rewrite rules.
- The newly added rewrite rule `DefaultNdvScaleRule` is applied
after `CountDistinctToNdvRule`.

Usage:

To set a default ndv scale:
```
SET DEFAULT_NDV_SCALE = 10;  -- The range is [1, 10]
```

To unset:
```
SET DEFAULT_NDV_SCALE = 2;
```

Here are test results of a typical workload (cardinality=40,090,650):
+====================================================================+
|   Metric    | Count Distinct |    NDV2    |    NDV5    |    NDV10  |
+--------------------------------------------------------------------+
|  Memory(GB) |       3.83     |    1.84    |    1.85    |     1.89  |
| Duration(s) |      182.89    |   30.22    |    29.72   |     29.24 |
|  ErrorRate  |        0%      |    1.8%    |    1.17%   |     0.06% |
+====================================================================+

Testing:
1) Added 3 unit test cases in `ExprRewriteRulesTest`.
2) Added 5 unit test cases in `ExprRewriterTest`.
3) Ran all front-end unit test, passed.
4) Added a new query-option test.

Change-Id: I1669858a6e8252e167b464586e8d0b6cb0d0bd50
---
M be/src/service/query-options-test.cc
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/analysis/Analyzer.java
M fe/src/main/java/org/apache/impala/analysis/SelectStmt.java
A fe/src/main/java/org/apache/impala/rewrite/CountDistinctToNdvRule.java
A fe/src/main/java/org/apache/impala/rewrite/DefaultNdvScaleRule.java
M fe/src/test/java/org/apache/impala/analysis/ExprRewriteRulesTest.java
M fe/src/test/java/org/apache/impala/analysis/ExprRewriterTest.java
11 files changed, 229 insertions(+), 34 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/06/17306/7
--
To view, visit http://gerrit.cloudera.org:8080/17306
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I1669858a6e8252e167b464586e8d0b6cb0d0bd50
Gerrit-Change-Number: 17306
Gerrit-PatchSet: 7
Gerrit-Owner: fifteencai <fifteen...@tencent.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>

Reply via email to