wypoon commented on pull request #3038:
URL: https://github.com/apache/iceberg/pull/3038#issuecomment-906932513
Here is a part of the optimized logical plan for a TPC-DS query (q18) on V1
tables:
```
+- Project [c_customer_sk#50, c_current_cdemo_sk#52, c_current_addr_sk#54,
c_birth_year#63], Statistics(sizeInBytes=2.7 MiB)
+- Filter (((c_birth_month#62 IN (1,6,8,9,12,2) AND
isnotnull(c_customer_sk#50)) AND isnotnull(c_current_cdemo_sk#52)) AND
isnotnull(c_current_addr_sk#54)), Statistics(sizeInBytes=25.5 MiB)
+-
Relation[c_customer_sk#50,c_customer_id#51,c_current_cdemo_sk#52,c_current_hdemo_sk#53,c_current_addr_sk#54,c_first_shipto_date_sk#55,c_first_sales_date_sk#56,c_salutation#57,c_first_name#58,c_last_name#59,c_preferred_cust_flag#60,c_birth_day#61,c_birth_month#62,c_birth_year#63,c_birth_country#64,c_login#65,c_email_address#66,c_last_review_date#67]
parquet, Statistics(sizeInBytes=25.5 MiB)
```
The customer table is 25.5 MiB but the Project plan (4 columns of the 18
columns of the customer table) is small enough to be broadcast.
Here is the equivalent snippet for Iceberg tables before this change:
```
+- Project [c_customer_sk#59, c_current_cdemo_sk#61, c_current_addr_sk#63,
c_birth_year#72], Statistics(sizeInBytes=21.9 MiB)
+- Filter (((c_birth_month#71 IN (1,6,8,9,12,2) AND
isnotnull(c_customer_sk#59)) AND isnotnull(c_current_cdemo_sk#61)) AND
isnotnull(c_current_addr_sk#63)), Statistics(sizeInBytes=25.5 MiB)
+- RelationV2[c_customer_sk#59, c_current_cdemo_sk#61,
c_current_addr_sk#63, c_birth_month#71, c_birth_year#72]
spark_catalog.tpcds_10_iceberg.customer, Statistics(sizeInBytes=25.5 MiB,
rowCount=5.00E+5)
```
A read schema of 5 columns has already been pushed down to the Iceberg
customer table, but the statistics for it is the full 25.5 MiB for the table.
Consequently, the Project plan, which is estimated basically using the ratio of
the size of the 4 columns to the size of the 5 columns times the 25.5 MiB, is
too big to be broadcast.
With this change, the equivalent snippet becomes:
```
+- Project [c_customer_sk#59, c_current_cdemo_sk#61, c_current_addr_sk#63,
c_birth_year#72], Statistics(sizeInBytes=2.0 MiB)
+- Filter (((c_birth_month#71 IN (1,6,8,9,12,2) AND
isnotnull(c_customer_sk#59)) AND isnotnull(c_current_cdemo_sk#61)) AND
isnotnull(c_current_addr_sk#63)), Statistics(sizeInBytes=2.4 MiB)
+- RelationV2[c_customer_sk#59, c_current_cdemo_sk#61,
c_current_addr_sk#63, c_birth_month#71, c_birth_year#72]
spark_catalog.tpcds_10_iceberg.customer, Statistics(sizeInBytes=2.4 MiB,
rowCount=5.00E+5)
```
Here the statistics for the Iceberg customer table is adjusted according to
read schema and is now 2.4 MiB, and the estimate for the Project adjusts it by
the ratio of the size of the 4 columns to the size of the 5 columns, making it
2.0 MiB, and it can be broadcast.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]