SemyonSinchenko commented on PR #46368: URL: https://github.com/apache/spark/pull/46368#issuecomment-2400544337
@zhengruifeng You are absolutely right that such an API is for library developers, not end users. It is fine to call `queryExecution().optimizedPlan.stats.sizeInBytes` from a PySpark Classic via `py4j`, but the problem is that for Spark Connect it is impossible. Size in bytes is very important if you want to choose between broadcast/collect and distributed processing depends on the upper bound of the size estimation. Otherwise, PySpark Connect library developers are forced to parse the string representation of the plan... I added this to the Classic API with the sole purpose of providing parity and I can remove it from Classic at all if you want. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
