zeruibao opened a new pull request, #52331: URL: https://github.com/apache/spark/pull/52331
### What changes were proposed in this pull request? This PR introduces an optimization to JVM–Python communication in TWS by allowing multiple keys to be grouped into a single Arrow batch. Currently, each Arrow batch is restricted to contain records for a single key. In high-cardinality scenarios, this results in many small Arrow batches (e.g., [(key1, value1), (key1, value2)], [(key2, value1), (key2, value2)]), which increases the overhead of Arrow batch transmission between the JVM and Python. With this change, records with different keys can be bin-packed into the same Arrow batch, reducing the number of batches transmitted. On the Python side, we leverage groupBy to regroup records by key, mirroring the behavior of the Scala GroupedIterator implementation. This PR only handle `TransformWithStateInPySparkPythonRunner`. `TransformWithStateInPySparkPythonInitialStateRunner` would only affect the batch 0 so that we will leave to another PR. This approach significantly reduces transmission overhead while preserving correct grouping semantics. ### Why are the changes needed? Benchmark results show that in high-cardinality scenarios, this optimization improves throughput by ~20% by reducing the overhead of Arrow batch transmission. For low-cardinality scenarios, the change introduces no observable regression,. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing UT and Benchmark. ### Was this patch authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
