Re: [PR] [SPARK-47336][SQL][CONNECT] Provide to PySpark a functionality to get estimated size of DataFrame in bytes [spark]

via GitHub Tue, 08 Oct 2024 11:27:52 -0700


SemyonSinchenko commented on PR #46368:
URL: https://github.com/apache/spark/pull/46368#issuecomment-2400544337


   @zhengruifeng You are absolutely right that such an API is for library 
developers, not end users. It is fine to call 
`queryExecution().optimizedPlan.stats.sizeInBytes` from a PySpark Classic via 
`py4j`, but the problem is that for Spark Connect it is impossible. Size in 
bytes is very important if you want to choose between broadcast/collect and 
distributed processing depends on the upper bound of the size estimation. 
Otherwise, PySpark Connect library developers are forced to parse the string 
representation of the plan... I added this to the Classic API with the sole 
purpose of providing parity and I can remove it from Classic at all if you want.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-47336][SQL][CONNECT] Provide to PySpark a functionality to get estimated size of DataFrame in bytes [spark]

Reply via email to