fifteencai has uploaded a new patch set (#7). ( http://gerrit.cloudera.org:8080/17306 )
Change subject: IMPALA-10445: Adjust NDV's precision/scale by query option ...................................................................... IMPALA-10445: Adjust NDV's precision/scale by query option We introduce a new way to control NDV's scale. Since IMPALA-2658, we can trade additional memory for more accurate estimation by setting larger `scale`. The scale is decided by SQL writers. However, it is a bumpy road to smoothly upgrade ndv by setting scales larger than the default value of 2. Here lies 2 reasons: -Firstly, SQL writers are reluctant to lower expectations, they prone to write ndv(id, 10) other than ndv(id, 9), ndv(id, 8) and so on. But larger scales like 10 will use much more memories, especially when there are `group by`s with high cardinality. So it is wiser to let cluster admin to choose appropriate scale instead. Secondly, The queries are stored in a BI tool's configuration DB. Rewriting thousands of SQLs is a risky job. In this commit, we introduced a new Query Option `DEFAULT_NDV_SCALE`. During to its runtime essence, We can either set it before query submission according to cluster's overall load, or set it by placing a default query option for dynamic resource pool. We also modified the `Analyzer` to make sure APPX_COUNT_DISTINCT can work with this query option. So if needed, we can degrade service by transforming `count (distinct id)` to `ndv(id, scale)`. Implementation details: - The default value of DEFAULT_NDV_SCALE is 2, so we won't change the default ndv behavior. - We port `CountDistinctToNdv` transform logic from `SelectStmt.analyze()` to `ExprRewriter`, making it compatible with further rewrite rules. - The newly added rewrite rule `DefaultNdvScaleRule` is applied after `CountDistinctToNdvRule`. Usage: To set a default ndv scale: ``` SET DEFAULT_NDV_SCALE = 10; -- The range is [1, 10] ``` To unset: ``` SET DEFAULT_NDV_SCALE = 2; ``` Here are test results of a typical workload (cardinality=40,090,650): +====================================================================+ | Metric | Count Distinct | NDV2 | NDV5 | NDV10 | +--------------------------------------------------------------------+ | Memory(GB) | 3.83 | 1.84 | 1.85 | 1.89 | | Duration(s) | 182.89 | 30.22 | 29.72 | 29.24 | | ErrorRate | 0% | 1.8% | 1.17% | 0.06% | +====================================================================+ Testing: 1) Added 3 unit test cases in `ExprRewriteRulesTest`. 2) Added 5 unit test cases in `ExprRewriterTest`. 3) Ran all front-end unit test, passed. 4) Added a new query-option test. Change-Id: I1669858a6e8252e167b464586e8d0b6cb0d0bd50 --- M be/src/service/query-options-test.cc M be/src/service/query-options.cc M be/src/service/query-options.h M common/thrift/ImpalaService.thrift M common/thrift/Query.thrift M fe/src/main/java/org/apache/impala/analysis/Analyzer.java M fe/src/main/java/org/apache/impala/analysis/SelectStmt.java A fe/src/main/java/org/apache/impala/rewrite/CountDistinctToNdvRule.java A fe/src/main/java/org/apache/impala/rewrite/DefaultNdvScaleRule.java M fe/src/test/java/org/apache/impala/analysis/ExprRewriteRulesTest.java M fe/src/test/java/org/apache/impala/analysis/ExprRewriterTest.java 11 files changed, 229 insertions(+), 34 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/06/17306/7 -- To view, visit http://gerrit.cloudera.org:8080/17306 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I1669858a6e8252e167b464586e8d0b6cb0d0bd50 Gerrit-Change-Number: 17306 Gerrit-PatchSet: 7 Gerrit-Owner: fifteencai <fifteen...@tencent.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>