David Rorke created IMPALA-11787:
------------------------------------

             Summary: Cardinality estimate for UNION in Iceberg position-delete 
plans can double the actual table cardinality
                 Key: IMPALA-11787
                 URL: https://issues.apache.org/jira/browse/IMPALA-11787
             Project: IMPALA
          Issue Type: Bug
          Components: Frontend
            Reporter: David Rorke
            Assignee: Zoltán Borók-Nagy
         Attachments: q78_iceberg_update_profile.txt.gz

The plan for Iceberg tables with position-delete files includes a UNION 
operator that takes the following inputs:

* LHS: Scan of the data files for the tale that don't have corresponding delete 
files
* RHS: ANTI JOIN that takes filters the data files that do have corresponding 
delete files based on the content of the delete files.

The planner's cardinality estimates for each of these two inputs to the UNION 
can be as large as the full row count of the table (assuming no other 
predicates in the scan) and the planner simply sums these in the UNION which 
can result in a cardinality estimate for the UNION that's twice the size of the 
table.  For example here's an excerpt from the TPC-DS query 78 summary (the 
total row count for the full store_sales table is 8.64B and the planner 
estimate of the union cardinality is twice that):



 
{code:java}
Operator                              #Hosts  #Inst   Avg Time   Max Time    
#Rows  Est. #Rows    Peak Mem  Est. Peak Mem  Detail
...
|  |  04:UNION                            10    120    9.168ms   17.416ms    
1.66B      17.28B           0              0
|  |  |--02:DELETE EVENTS HASH JOIN       10    120  138.821us  615.342us       
 0       8.64B    42.12 KB              0  LEFT ANTI JOIN, BROADCAST
|  |  |  |--F22:JOIN BUILD                10     10    5s741ms    5s944ms       
                   3.09 GB        3.07 GB
|  |  |  |  35:EXCHANGE                   10     10  212.271ms  264.374ms   
12.84M      12.84M    14.98 MB       36.84 MB  BROADCAST
|  |  |  |  F12:EXCHANGE SENDER           10    120   30.579ms   40.774ms       
                  60.94 KB              0
|  |  |  |  01:SCAN S3                    10    120   75.266ms  417.286ms   
12.84M      12.84M   468.96 KB       16.00 MB  
tpcds_3000_iceberg_parquet_v2_update_q1_98.store_sales-POSITION-DELETE-01 
tpcds_3000_iceberg_parquet_v2_update_q1_98.store_sales-position-delete
|  |  |  00:SCAN S3                       10    120    2s645ms   10s064ms       
 0       8.64B     4.01 MB       16.00 MB  
tpcds_3000_iceberg_parquet_v2_update_q1_98.store_sales
|  |  03:SCAN S3                          10    120    2s413ms    4s484ms    
1.66B       8.64B   103.32 MB       88.00 MB  
tpcds_3000_iceberg_parquet_v2_update_q1_98.store_sales
 
{code}

The planner should account for the fact that each side of the UNION is only 
scanning a subset of the table (so the UNION cardinality can't be greater than 
the actual table size) and also if possible try to estimate the impact of the 
filtering done by the ANTI JOIN.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to