David Rorke created IMPALA-11787:
------------------------------------
Summary: Cardinality estimate for UNION in Iceberg position-delete
plans can double the actual table cardinality
Key: IMPALA-11787
URL: https://issues.apache.org/jira/browse/IMPALA-11787
Project: IMPALA
Issue Type: Bug
Components: Frontend
Reporter: David Rorke
Assignee: Zoltán Borók-Nagy
Attachments: q78_iceberg_update_profile.txt.gz
The plan for Iceberg tables with position-delete files includes a UNION
operator that takes the following inputs:
* LHS: Scan of the data files for the tale that don't have corresponding delete
files
* RHS: ANTI JOIN that takes filters the data files that do have corresponding
delete files based on the content of the delete files.
The planner's cardinality estimates for each of these two inputs to the UNION
can be as large as the full row count of the table (assuming no other
predicates in the scan) and the planner simply sums these in the UNION which
can result in a cardinality estimate for the UNION that's twice the size of the
table. For example here's an excerpt from the TPC-DS query 78 summary (the
total row count for the full store_sales table is 8.64B and the planner
estimate of the union cardinality is twice that):
{code:java}
Operator #Hosts #Inst Avg Time Max Time
#Rows Est. #Rows Peak Mem Est. Peak Mem Detail
...
| | 04:UNION 10 120 9.168ms 17.416ms
1.66B 17.28B 0 0
| | |--02:DELETE EVENTS HASH JOIN 10 120 138.821us 615.342us
0 8.64B 42.12 KB 0 LEFT ANTI JOIN, BROADCAST
| | | |--F22:JOIN BUILD 10 10 5s741ms 5s944ms
3.09 GB 3.07 GB
| | | | 35:EXCHANGE 10 10 212.271ms 264.374ms
12.84M 12.84M 14.98 MB 36.84 MB BROADCAST
| | | | F12:EXCHANGE SENDER 10 120 30.579ms 40.774ms
60.94 KB 0
| | | | 01:SCAN S3 10 120 75.266ms 417.286ms
12.84M 12.84M 468.96 KB 16.00 MB
tpcds_3000_iceberg_parquet_v2_update_q1_98.store_sales-POSITION-DELETE-01
tpcds_3000_iceberg_parquet_v2_update_q1_98.store_sales-position-delete
| | | 00:SCAN S3 10 120 2s645ms 10s064ms
0 8.64B 4.01 MB 16.00 MB
tpcds_3000_iceberg_parquet_v2_update_q1_98.store_sales
| | 03:SCAN S3 10 120 2s413ms 4s484ms
1.66B 8.64B 103.32 MB 88.00 MB
tpcds_3000_iceberg_parquet_v2_update_q1_98.store_sales
{code}
The planner should account for the fact that each side of the UNION is only
scanning a subset of the table (so the UNION cardinality can't be greater than
the actual table size) and also if possible try to estimate the impact of the
filtering done by the ANTI JOIN.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)