[
https://issues.apache.org/jira/browse/IMPALA-10445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17327113#comment-17327113
]
ASF subversion and git services commented on IMPALA-10445:
----------------------------------------------------------
Commit 1fb7dbac0d43f3ccbbbbaaf9c41db10d3320fc48 in impala's branch
refs/heads/master from fifteencai
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=1fb7dba ]
IMPALA-10445: Adjust NDV's scale with query option
This is a new way to control NDV's scale.
Since IMPALA-2658, we can trade memory for more accurate
estimation by setting larger `scale` in SQL function
NDV(<expr>, <scale>). However the use of larger NDV scale requires
the modification of SQL queries which may not be practical in certain
applications:
- Firstly, SQL writers are reluctant to lower that scale. They prone
to fill up the scale, which will make the cluster unstable. Especially
when there are `group by`s with high cardinalities. So it is wiser to
let cluster admin other than sql writer choose appropriate scale.
- Secondly, In some application scenarios, queries are stored in DBs.
In a BI system, for example, rewriting thousands of SQLs is risky.
In this commit, we introduced a new Query Option `DEFAULT_NDV_SCALE`
with the following semantics:
1. The allowed value is in the range [1..10];
2. Previously, the scale used in NDV(<expr>) functions was fixed at 2.
Now the scale is provided by the newly added query options.
3. It does not influence the NDV scale for SQL function
NDV(<expr>, <scale>) in which the NDV scale is provided by the 2nd
argument <scale>.
We also refactored method `Analyze` to make sure APPX_COUNT_DISTINCT
can work with this query option. After this, cluster admins can
substitute `count(distinct <expr>)` with `ndv(<expr>, scale)`.
Implementation details:
- The default value of DEFAULT_NDV_SCALE is 2, so we won't change
the default ndv behavior.
- We port `CountDistinctToNdv` transform logic from
`SelectStmt.analyze()` to `ExprRewriter`, making it compatible with
further rewrite rules.
- The newly added rewrite rule `DefaultNdvScaleRule` is applied
after `CountDistinctToNdvRule`.
Usage:
To set a default ndv scale:
```
SET DEFAULT_NDV_SCALE = 10;
```
To unset:
```
SET DEFAULT_NDV_SCALE = 2;
```
Here are test results of a typical workload (cardinality=40,090,650):
+====================================================================+
| Metric | Count Distinct | NDV2 | NDV5 | NDV10 |
+--------------------------------------------------------------------+
| Memory(GB) | 3.83 | 1.84 | 1.85 | 1.89 |
| Duration(s) | 182.89 | 30.22 | 29.72 | 29.24 |
| ErrorRate | 0% | 1.8% | 1.17% | 0.06% |
+====================================================================+
Testing:
1) Added 3 unit test cases in `ExprRewriteRulesTest`.
2) Added 5 unit test cases in `ExprRewriterTest`.
3) Ran all front-end unit test, passed.
4) Added a new query-option test.
Change-Id: I1669858a6e8252e167b464586e8d0b6cb0d0bd50
Reviewed-on: http://gerrit.cloudera.org:8080/17306
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>
> The ability to adjust NDV's precision with query option
> -------------------------------------------------------
>
> Key: IMPALA-10445
> URL: https://issues.apache.org/jira/browse/IMPALA-10445
> Project: IMPALA
> Issue Type: Improvement
> Components: Frontend
> Affects Versions: Impala 4.0
> Reporter: Fifteen
> Assignee: Fifteen
> Priority: Minor
>
> Since IMPALA-2658, we can trade memory for more accurate NDV estimation. It
> is fascinating because tests showing error rate within 0.1% while no
> tremendous resource usage rise is found( #registers is 2 << 18). Users may
> have less complaint on computation precision in the future.
> However, the road to apply high precision NDV to production environment is
> uneven.
> 1) We have to re-write sqls for a large number of historical workloads. Which
> is time costing and is prone to error.
> 2) Cluster users, aka sql writers, are reluctant to lower their expectations.
> It would be more convenient to have a way for cluster admins to adjust
> precision for each Admission Control queue according to cluster's resource
> usage(rough world).
> Propose:
> Add a new query option DEFAULT_NDV_SCALE to change the default precision
> setting for NDV()
> Implementation:
> # Add a query option in FE
> # If the option is set, use the matching NDV(<expr>, <scale>) function
> instead of NDV().
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]