[GitHub] [iceberg] wypoon opened a new pull request #3038: Spark: Better SparkBatchScan statistics estimation

GitBox Thu, 26 Aug 2021 18:49:58 -0700


wypoon opened a new pull request #3038:
URL: https://github.com/apache/iceberg/pull/3038



   When estimating statistics, we should take the read schema into account.
   As explained in the [PR](https://github.com/apache/spark/pull/33825) for 
[SPARK-36568](https://issues.apache.org/jira/browse/SPARK-36568):
   `V2ScanRelationPushDown` can column prune `DataSourceV2ScanRelation`s and 
change read schema of `Scan` operations. 
   In `SparkBatchScan.estimateStatistics()`, before this change, we sum the 
file sizes of the files to be scanned and use that as the estimate for the size 
in bytes. With this change, we adjust that by the ratio of the size of the 
columns to be read to the size of all the columns; we also adjust that by the 
compression factor in case it is set (the default is 1.0). With this change, 
some joins that are broadcast joins on V1 tables but are SortMergeJoins on 
Iceberg tables can be broadcast joins as well.
   Basically this change does for Iceberg what the above-mentioned Spark PR 
does for the built-in V2 `FileScan`s.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] wypoon opened a new pull request #3038: Spark: Better SparkBatchScan statistics estimation

Reply via email to