[ 
https://issues.apache.org/jira/browse/IMPALA-10445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17333256#comment-17333256
 ] 

ASF subversion and git services commented on IMPALA-10445:
----------------------------------------------------------

Commit 1fb7dbac0d43f3ccbbbbaaf9c41db10d3320fc48 in impala's branch 
refs/heads/dependabot/pip/infra/python/deps/py-1.10.0 from fifteencai
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=1fb7dba ]

IMPALA-10445: Adjust NDV's scale with query option

This is a new way to control NDV's scale.

Since IMPALA-2658, we can trade memory for more accurate
estimation by setting larger `scale` in SQL function
NDV(<expr>, <scale>). However the use of larger NDV scale requires
the modification of SQL queries which may not be practical in certain
applications:

- Firstly, SQL writers are reluctant to lower that scale. They prone
to fill up the scale, which will make the cluster unstable. Especially
when there are `group by`s with high cardinalities. So it is wiser to
let cluster admin other than sql writer choose appropriate scale.

- Secondly, In some application scenarios, queries are stored in DBs.
In a BI system, for example, rewriting thousands of SQLs is risky.

In this commit, we introduced a new Query Option `DEFAULT_NDV_SCALE`
with the following semantics:

1. The allowed value is in the range [1..10];
2. Previously, the scale used in NDV(<expr>) functions was fixed at 2.
Now the scale is provided by the newly added query options.
3. It does not influence the NDV scale for SQL function
NDV(<expr>, <scale>) in which the NDV scale is provided by the 2nd
argument <scale>.

We also refactored method `Analyze` to make sure APPX_COUNT_DISTINCT
can work with this query option. After this, cluster admins can
substitute `count(distinct <expr>)` with `ndv(<expr>, scale)`.

Implementation details:

- The default value of DEFAULT_NDV_SCALE is 2, so we won't change
the default ndv behavior.
- We port `CountDistinctToNdv` transform logic from
`SelectStmt.analyze()` to `ExprRewriter`, making it compatible with
further rewrite rules.
- The newly added rewrite rule `DefaultNdvScaleRule` is applied
after `CountDistinctToNdvRule`.

Usage:

To set a default ndv scale:
```
SET DEFAULT_NDV_SCALE = 10;
```

To unset:
```
SET DEFAULT_NDV_SCALE = 2;
```

Here are test results of a typical workload (cardinality=40,090,650):
+====================================================================+
|   Metric    | Count Distinct |    NDV2    |    NDV5    |    NDV10  |
+--------------------------------------------------------------------+
|  Memory(GB) |       3.83     |    1.84    |    1.85    |     1.89  |
| Duration(s) |      182.89    |   30.22    |    29.72   |     29.24 |
|  ErrorRate  |        0%      |    1.8%    |    1.17%   |     0.06% |
+====================================================================+

Testing:
1) Added 3 unit test cases in `ExprRewriteRulesTest`.
2) Added 5 unit test cases in `ExprRewriterTest`.
3) Ran all front-end unit test, passed.
4) Added a new query-option test.

Change-Id: I1669858a6e8252e167b464586e8d0b6cb0d0bd50
Reviewed-on: http://gerrit.cloudera.org:8080/17306
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> The ability to adjust NDV's precision with query option
> -------------------------------------------------------
>
>                 Key: IMPALA-10445
>                 URL: https://issues.apache.org/jira/browse/IMPALA-10445
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Frontend
>    Affects Versions: Impala 4.0
>            Reporter: Fifteen
>            Assignee: Fifteen
>            Priority: Minor
>             Fix For: Impala 4.0
>
>
> Since IMPALA-2658, we can trade memory for more accurate NDV estimation. It 
> is fascinating because tests showing error rate within 0.1% while no 
> tremendous resource usage rise is found( #registers is 2 << 18). Users may 
> have less complaint on computation precision in the future.
> However, the road to apply high precision NDV to production environment is 
> uneven. 
> 1) We have to re-write sqls for a large number of historical workloads. Which 
> is time costing and is prone to error.
> 2) Cluster users, aka sql writers, are reluctant to lower their expectations. 
> It would be more convenient to have a way for cluster admins to adjust 
> precision for each Admission Control queue according to cluster's resource 
> usage(rough world).
> Propose:
> Add a new query option DEFAULT_NDV_SCALE to change the  default precision 
> setting for NDV() 
> Implementation:
>  # Add a query option in FE
>  # If the option is set, use the matching NDV(<expr>, <scale>) function 
> instead of NDV(). 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to