bryanck opened a new pull request, #5136:
URL: https://github.com/apache/iceberg/pull/5136

   This PR adds implementing the `SupportsReportStatistics` interface to the 
Spark v3.x scan builder classes. In Spark's 
`DataSourceV2Relation.computeStats()`, there is a [case that 
checks](https://github.com/apache/spark/blob/61dc08da34a405a61a77b0e173bf22bb9f11bcfd/sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala#L83)
 if the scan builder implements `SupportsReportStatistics`, which currently is 
not implemented. This causes stats to fall back to `conf.defaultSizeInBytes`. 
With this change, stats are reported correctly, which allows systems such as 
EMR to properly apply optimizations such as join reordering.
   
   Note that the case was [recently 
updated](https://github.com/apache/spark/blob/691b9c70bcaddf494e98617348dd18debd68cadf/sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala#L83)
 in Spark to explicitly call `build()` on the builder, so this won't be needed 
in future versions of Spark (but appears to still be needed for v3.3 currently).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to