[ 
https://issues.apache.org/jira/browse/IMPALA-11787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltán Borók-Nagy resolved IMPALA-11787.
----------------------------------------
    Fix Version/s: Impala 4.3.0
       Resolution: Fixed

> Cardinality estimate for UNION in Iceberg position-delete plans can double 
> the actual table cardinality
> -------------------------------------------------------------------------------------------------------
>
>                 Key: IMPALA-11787
>                 URL: https://issues.apache.org/jira/browse/IMPALA-11787
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>            Reporter: David Rorke
>            Assignee: Zoltán Borók-Nagy
>            Priority: Major
>              Labels: impala-iceberg
>             Fix For: Impala 4.3.0
>
>         Attachments: q78_iceberg_update_profile.txt.gz
>
>
> The plan for Iceberg tables with position-delete files includes a UNION 
> operator that takes the following inputs:
>  * LHS: Scan of the data files that don't have corresponding delete files
>  * RHS: ANTI JOIN that filters the data files that do have corresponding 
> delete files based on the content of the delete files.
> The planner's cardinality estimates for each of these two inputs to the UNION 
> can be as large as the full row count of the table (assuming no other 
> predicates in the scan) and the planner simply sums these in the UNION which 
> can result in a cardinality estimate for the UNION that's twice the size of 
> the table.  For example here's an excerpt from the TPC-DS query 78 summary 
> (the total row count for the full store_sales table is 8.64B and the planner 
> estimate of the union cardinality is twice that):
>  
> {noformat}
> Operator                              #Hosts  #Inst   Avg Time   Max Time    
> #Rows  Est. #Rows    Peak Mem  Est. Peak Mem  Detail
> ...
> |  |  04:UNION                            10    120    9.168ms   17.416ms    
> 1.66B      17.28B           0              0
> |  |  |--02:DELETE EVENTS HASH JOIN       10    120  138.821us  615.342us     
>    0       8.64B    42.12 KB              0  LEFT ANTI JOIN, BROADCAST
> |  |  |  |--F22:JOIN BUILD                10     10    5s741ms    5s944ms     
>                      3.09 GB        3.07 GB
> |  |  |  |  35:EXCHANGE                   10     10  212.271ms  264.374ms   
> 12.84M      12.84M    14.98 MB       36.84 MB  BROADCAST
> |  |  |  |  F12:EXCHANGE SENDER           10    120   30.579ms   40.774ms     
>                     60.94 KB              0
> |  |  |  |  01:SCAN S3                    10    120   75.266ms  417.286ms   
> 12.84M      12.84M   468.96 KB       16.00 MB  
> tpcds_3000_iceberg_parquet_v2_update_q1_98.store_sales-POSITION-DELETE-01 
> tpcds_3000_iceberg_parquet_v2_update_q1_98.store_sales-position-delete
> |  |  |  00:SCAN S3                       10    120    2s645ms   10s064ms     
>    0       8.64B     4.01 MB       16.00 MB  
> tpcds_3000_iceberg_parquet_v2_update_q1_98.store_sales
> |  |  03:SCAN S3                          10    120    2s413ms    4s484ms    
> 1.66B       8.64B   103.32 MB       88.00 MB  
> tpcds_3000_iceberg_parquet_v2_update_q1_98.store_sales
>  
> {noformat}
> The planner should account for the fact that each side of the UNION is only 
> scanning a subset of the table (so the UNION cardinality can't be greater 
> than the actual table size) and also if possible try to estimate the impact 
> of the filtering done by the ANTI JOIN.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to