[
https://issues.apache.org/jira/browse/IMPALA-12454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17766930#comment-17766930
]
Riza Suminto commented on IMPALA-12454:
---------------------------------------
Tried to hack it myself and found that using exponential backoff will cause
cardinality overestimation in low scale TPC-DS, and even change the query plan
shape in some of them. These tests were changed:
{code:java}
modified:
testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q13.test
modified:
testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q41.test
modified:
testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q47.test
modified:
testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q48.test
modified:
testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q53.test
modified:
testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q57.test
modified:
testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q63.test
modified:
testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q85.test
modified:
testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q89.test
modified:
testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q91.test
{code}
It is probably better to solve this with column correlation or histogram stats
in the future.
> CompoudPredicate with AND operator can result in very low selectivity.
> ----------------------------------------------------------------------
>
> Key: IMPALA-12454
> URL: https://issues.apache.org/jira/browse/IMPALA-12454
> Project: IMPALA
> Issue Type: Improvement
> Components: Frontend
> Affects Versions: Impala 4.2.0
> Reporter: Riza Suminto
> Priority: Major
>
> CompoudPredicate with AND operator estimate its selectivity by doing simple
> multiplication of its child expression's selectivity.
> [https://github.com/apache/impala/blob/3614a6a776819a1e918ce7fe833cd9e916d6002a/fe/src/main/java/org/apache/impala/analysis/CompoundPredicate.java#L174-L176]
>
>
> This can lead to very low number, like what happen in TPC-DS Q53.
> {code:java}
> | F01:PLAN FRAGMENT [RANDOM] hosts=4 instances=4
> | Per-Instance Resources: mem-estimate=24.30MB mem-reservation=1.00MB
> thread-reservation=1
> | 00:SCAN S3 [tpcds_3000_string_parquet_managed.item, RANDOM]
> | S3 partitions=1/1 files=4 size=33.54MB
> | predicates: ((i_category IN ('Books', 'Children', 'Electronics') AND
> i_class IN ('personal', 'portable', 'reference', 'self-help') AND i_brand IN
> ('scholaramalgamalg #14', 'scholaramalgamalg #7', 'exportiunivamalg #9',
> 'scholaramalgamalg #9')) OR (i_category IN ('Women', 'Music', 'Men') AND
> i_class IN ('accessories', 'classical', 'fragrances', 'pants') AND i_brand IN
> ('amalgimporto #1', 'edu packscholar #1', 'exportiimporto #1', 'importoamalg
> #1')))
> | stored statistics:
> | table: rows=360.00K size=33.54MB
> | columns: all
> | extrapolated-rows=disabled max-scan-range-rows=117.77K
> | mem-estimate=24.00MB mem-reservation=1.00MB thread-reservation=0
> | tuple-ids=0 row-size=74B cardinality=51
> | in pipelines: 00(GETNEXT) {code}
> The CompoudPredicate in 00:SCAN estimate very high selectivity, reducing 360K
> rows into just 51. While in reality, it return 18.53K rows.
> {code:java}
> | 00:SCAN S3 4 4 18.000ms 24.000ms 18.53K
> 51 2.31 MB 24.00 MB tpcds_3000_string_parquet_managed.item {code}
> Selectivity estimation in this CompoudPredicate case should use exponential
> backoff algorithm similar as in PlanNode.computeCombinedSelectivity().
> [https://github.com/apache/impala/blob/3614a6a776819a1e918ce7fe833cd9e916d6002a/fe/src/main/java/org/apache/impala/planner/PlanNode.java#L730-L733]
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]