[GitHub] [iceberg] wypoon commented on pull request #3038: Spark: Better SparkBatchScan statistics estimation

GitBox Thu, 26 Aug 2021 22:16:33 -0700


wypoon commented on pull request #3038:
URL: https://github.com/apache/iceberg/pull/3038#issuecomment-906932513



   Here is a part of the optimized logical plan for a TPC-DS query (q18) on V1 
tables:
   ```
   +- Project [c_customer_sk#50, c_current_cdemo_sk#52, c_current_addr_sk#54, 
c_birth_year#63], Statistics(sizeInBytes=2.7 MiB)
      +- Filter (((c_birth_month#62 IN (1,6,8,9,12,2) AND 
isnotnull(c_customer_sk#50)) AND isnotnull(c_current_cdemo_sk#52)) AND 
isnotnull(c_current_addr_sk#54)), Statistics(sizeInBytes=25.5 MiB)
         +- 
Relation[c_customer_sk#50,c_customer_id#51,c_current_cdemo_sk#52,c_current_hdemo_sk#53,c_current_addr_sk#54,c_first_shipto_date_sk#55,c_first_sales_date_sk#56,c_salutation#57,c_first_name#58,c_last_name#59,c_preferred_cust_flag#60,c_birth_day#61,c_birth_month#62,c_birth_year#63,c_birth_country#64,c_login#65,c_email_address#66,c_last_review_date#67]
 parquet, Statistics(sizeInBytes=25.5 MiB)
   ```
   The customer table is 25.5 MiB but the Project plan (4 columns of the 18 
columns of the customer table) is small enough to be broadcast.
   Here is the equivalent snippet for Iceberg tables before this change:
   ```
   +- Project [c_customer_sk#59, c_current_cdemo_sk#61, c_current_addr_sk#63, 
c_birth_year#72], Statistics(sizeInBytes=21.9 MiB)
      +- Filter (((c_birth_month#71 IN (1,6,8,9,12,2) AND 
isnotnull(c_customer_sk#59)) AND isnotnull(c_current_cdemo_sk#61)) AND 
isnotnull(c_current_addr_sk#63)), Statistics(sizeInBytes=25.5 MiB)
         +- RelationV2[c_customer_sk#59, c_current_cdemo_sk#61, 
c_current_addr_sk#63, c_birth_month#71, c_birth_year#72] 
spark_catalog.tpcds_10_iceberg.customer, Statistics(sizeInBytes=25.5 MiB, 
rowCount=5.00E+5)
   ```
   A read schema of 5 columns has already been pushed down to the Iceberg 
customer table, but the statistics for it is the full 25.5 MiB for the table. 
Consequently, the Project plan, which is estimated basically using the ratio of 
the size of the 4 columns to the size of the 5 columns times the 25.5 MiB, is 
too big to be broadcast.
   With this change, the equivalent snippet becomes:
   ```
   +- Project [c_customer_sk#59, c_current_cdemo_sk#61, c_current_addr_sk#63, 
c_birth_year#72], Statistics(sizeInBytes=2.0 MiB)
      +- Filter (((c_birth_month#71 IN (1,6,8,9,12,2) AND 
isnotnull(c_customer_sk#59)) AND isnotnull(c_current_cdemo_sk#61)) AND 
isnotnull(c_current_addr_sk#63)), Statistics(sizeInBytes=2.4 MiB)
         +- RelationV2[c_customer_sk#59, c_current_cdemo_sk#61, 
c_current_addr_sk#63, c_birth_month#71, c_birth_year#72] 
spark_catalog.tpcds_10_iceberg.customer, Statistics(sizeInBytes=2.4 MiB, 
rowCount=5.00E+5)
   ```
   Here the statistics for the Iceberg customer table is adjusted according to 
read schema and is now 2.4 MiB, and the estimate for the Project adjusts it by 
the ratio of the size of the 4 columns to the size of the 5 columns, making it 
2.0 MiB, and it can be broadcast.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] wypoon commented on pull request #3038: Spark: Better SparkBatchScan statistics estimation

Reply via email to