bryanck opened a new pull request, #5136: URL: https://github.com/apache/iceberg/pull/5136
This PR adds implementing the `SupportsReportStatistics` interface to the Spark v3.x scan builder classes. In Spark's `DataSourceV2Relation.computeStats()`, there is a [case that checks](https://github.com/apache/spark/blob/61dc08da34a405a61a77b0e173bf22bb9f11bcfd/sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala#L83) if the scan builder implements `SupportsReportStatistics`, which currently is not implemented. This causes stats to fall back to `conf.defaultSizeInBytes`. With this change, stats are reported correctly, which allows systems such as EMR to properly apply optimizations such as join reordering. Note that the case was [recently updated](https://github.com/apache/spark/blob/691b9c70bcaddf494e98617348dd18debd68cadf/sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala#L83) in Spark to explicitly call `build()` on the builder, so this won't be needed in future versions of Spark (but appears to still be needed for v3.3 currently). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
