[ 
https://issues.apache.org/jira/browse/IMPALA-12454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17766930#comment-17766930
 ] 

Riza Suminto commented on IMPALA-12454:
---------------------------------------

Tried to hack it myself and found that using exponential backoff will cause 
cardinality overestimation in low scale TPC-DS, and even change the query plan 
shape in some of them. These tests were changed:
{code:java}
        modified:   
testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q13.test
        modified:   
testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q41.test
        modified:   
testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q47.test
        modified:   
testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q48.test
        modified:   
testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q53.test
        modified:   
testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q57.test
        modified:   
testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q63.test
        modified:   
testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q85.test
        modified:   
testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q89.test
        modified:   
testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q91.test 
{code}
It is probably better to solve this with column correlation or histogram stats 
in the future.

> CompoudPredicate with AND operator can result in very low selectivity.
> ----------------------------------------------------------------------
>
>                 Key: IMPALA-12454
>                 URL: https://issues.apache.org/jira/browse/IMPALA-12454
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Frontend
>    Affects Versions: Impala 4.2.0
>            Reporter: Riza Suminto
>            Priority: Major
>
> CompoudPredicate with AND operator estimate its selectivity by doing simple 
> multiplication of its child expression's selectivity.
> [https://github.com/apache/impala/blob/3614a6a776819a1e918ce7fe833cd9e916d6002a/fe/src/main/java/org/apache/impala/analysis/CompoundPredicate.java#L174-L176]
>  
>  
> This can lead to very low number, like what happen in TPC-DS Q53.
> {code:java}
> |  F01:PLAN FRAGMENT [RANDOM] hosts=4 instances=4
> |  Per-Instance Resources: mem-estimate=24.30MB mem-reservation=1.00MB 
> thread-reservation=1
> |  00:SCAN S3 [tpcds_3000_string_parquet_managed.item, RANDOM]
> |     S3 partitions=1/1 files=4 size=33.54MB
> |     predicates: ((i_category IN ('Books', 'Children', 'Electronics') AND 
> i_class IN ('personal', 'portable', 'reference', 'self-help') AND i_brand IN 
> ('scholaramalgamalg #14', 'scholaramalgamalg #7', 'exportiunivamalg #9', 
> 'scholaramalgamalg #9')) OR (i_category IN ('Women', 'Music', 'Men') AND 
> i_class IN ('accessories', 'classical', 'fragrances', 'pants') AND i_brand IN 
> ('amalgimporto #1', 'edu packscholar #1', 'exportiimporto #1', 'importoamalg 
> #1')))
> |     stored statistics:
> |       table: rows=360.00K size=33.54MB
> |       columns: all
> |     extrapolated-rows=disabled max-scan-range-rows=117.77K
> |     mem-estimate=24.00MB mem-reservation=1.00MB thread-reservation=0
> |     tuple-ids=0 row-size=74B cardinality=51
> |     in pipelines: 00(GETNEXT) {code}
> The CompoudPredicate in 00:SCAN estimate very high selectivity, reducing 360K 
> rows into just 51. While in reality, it return 18.53K rows.
> {code:java}
> |  00:SCAN S3                 4      4   18.000ms   24.000ms   18.53K         
>  51    2.31 MB       24.00 MB  tpcds_3000_string_parquet_managed.item {code}
> Selectivity estimation in this CompoudPredicate case should use exponential 
> backoff algorithm similar as in PlanNode.computeCombinedSelectivity().
> [https://github.com/apache/impala/blob/3614a6a776819a1e918ce7fe833cd9e916d6002a/fe/src/main/java/org/apache/impala/planner/PlanNode.java#L730-L733]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to