[ 
https://issues.apache.org/jira/browse/IMPALA-11787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648083#comment-17648083
 ] 

ASF subversion and git services commented on IMPALA-11787:
----------------------------------------------------------

Commit 33929bfccc995c22890ef8783d5f4671ef30bcae in impala's branch 
refs/heads/master from Zoltan Borok-Nagy
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=33929bfcc ]

IMPALA-11787, IMPALA-11516: Cardinality estimate for UNION in Iceberg 
position-delete plans can double the actual table cardinality

The plan for Iceberg tables with position-delete files includes a UNION
operator that takes the following inputs:
  LHS: Scan of the data files that don't have corresponding delete files
  RHS: ANTI JOIN that filters the data files that do have corresponding
       delete files based on the content of the delete files.

The planner's cardinality estimates for each of these two inputs to the
UNION can be as large as the full row count of the table (assuming no
other predicates in the scan) and the planner simply sums these in the
UNION which can result in a cardinality estimate for the UNION that's
twice the size of the table.

In this patch IcebergScanNode overrides computeCardinalities() of the
HdfsScanNode. The method is implemented similarly with a few
modifications:

* we exactly know the record counts of the data files
* for table sampling we know the file descriptors, hence the record
  counts as well
* IDENTITY-based partition conjuncts already filtered out the files, so
  we don't need their selectivity

So we calculate the SCAN NODE's cardinalities much more precisely.
This patch also sets the column stats for the virtual columns of the
scan node of the left-hand side of the ANTI JOIN. But because of
IMPALA-11797 the ANTI JOIN's cardinality always equals to the
LHS cardinality. IMPALA-11619 can also resolve this.

Testing:
 * planner tests updated

Change-Id: Ie2927c58c4adfd0ba1e135b63454ac9b07991cbf
Reviewed-on: http://gerrit.cloudera.org:8080/19354
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> Cardinality estimate for UNION in Iceberg position-delete plans can double 
> the actual table cardinality
> -------------------------------------------------------------------------------------------------------
>
>                 Key: IMPALA-11787
>                 URL: https://issues.apache.org/jira/browse/IMPALA-11787
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>            Reporter: David Rorke
>            Assignee: Zoltán Borók-Nagy
>            Priority: Major
>              Labels: impala-iceberg
>         Attachments: q78_iceberg_update_profile.txt.gz
>
>
> The plan for Iceberg tables with position-delete files includes a UNION 
> operator that takes the following inputs:
>  * LHS: Scan of the data files that don't have corresponding delete files
>  * RHS: ANTI JOIN that filters the data files that do have corresponding 
> delete files based on the content of the delete files.
> The planner's cardinality estimates for each of these two inputs to the UNION 
> can be as large as the full row count of the table (assuming no other 
> predicates in the scan) and the planner simply sums these in the UNION which 
> can result in a cardinality estimate for the UNION that's twice the size of 
> the table.  For example here's an excerpt from the TPC-DS query 78 summary 
> (the total row count for the full store_sales table is 8.64B and the planner 
> estimate of the union cardinality is twice that):
>  
> {noformat}
> Operator                              #Hosts  #Inst   Avg Time   Max Time    
> #Rows  Est. #Rows    Peak Mem  Est. Peak Mem  Detail
> ...
> |  |  04:UNION                            10    120    9.168ms   17.416ms    
> 1.66B      17.28B           0              0
> |  |  |--02:DELETE EVENTS HASH JOIN       10    120  138.821us  615.342us     
>    0       8.64B    42.12 KB              0  LEFT ANTI JOIN, BROADCAST
> |  |  |  |--F22:JOIN BUILD                10     10    5s741ms    5s944ms     
>                      3.09 GB        3.07 GB
> |  |  |  |  35:EXCHANGE                   10     10  212.271ms  264.374ms   
> 12.84M      12.84M    14.98 MB       36.84 MB  BROADCAST
> |  |  |  |  F12:EXCHANGE SENDER           10    120   30.579ms   40.774ms     
>                     60.94 KB              0
> |  |  |  |  01:SCAN S3                    10    120   75.266ms  417.286ms   
> 12.84M      12.84M   468.96 KB       16.00 MB  
> tpcds_3000_iceberg_parquet_v2_update_q1_98.store_sales-POSITION-DELETE-01 
> tpcds_3000_iceberg_parquet_v2_update_q1_98.store_sales-position-delete
> |  |  |  00:SCAN S3                       10    120    2s645ms   10s064ms     
>    0       8.64B     4.01 MB       16.00 MB  
> tpcds_3000_iceberg_parquet_v2_update_q1_98.store_sales
> |  |  03:SCAN S3                          10    120    2s413ms    4s484ms    
> 1.66B       8.64B   103.32 MB       88.00 MB  
> tpcds_3000_iceberg_parquet_v2_update_q1_98.store_sales
>  
> {noformat}
> The planner should account for the fact that each side of the UNION is only 
> scanning a subset of the table (so the UNION cardinality can't be greater 
> than the actual table size) and also if possible try to estimate the impact 
> of the filtering done by the ANTI JOIN.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to