wypoon opened a new pull request #3038: URL: https://github.com/apache/iceberg/pull/3038
When estimating statistics, we should take the read schema into account. As explained in the [PR](https://github.com/apache/spark/pull/33825) for [SPARK-36568](https://issues.apache.org/jira/browse/SPARK-36568): `V2ScanRelationPushDown` can column prune `DataSourceV2ScanRelation`s and change read schema of `Scan` operations. In `SparkBatchScan.estimateStatistics()`, before this change, we sum the file sizes of the files to be scanned and use that as the estimate for the size in bytes. With this change, we adjust that by the ratio of the size of the columns to be read to the size of all the columns; we also adjust that by the compression factor in case it is set (the default is 1.0). With this change, some joins that are broadcast joins on V1 tables but are SortMergeJoins on Iceberg tables can be broadcast joins as well. Basically this change does for Iceberg what the above-mentioned Spark PR does for the built-in V2 `FileScan`s. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
