nyaapa opened a new pull request, #53122:
URL: https://github.com/apache/spark/pull/53122

   ### What changes were proposed in this pull request?
   
   - group multiple keys into one arrow batch;
   generally will have much less batches in case of high keys cardinality.
   - do not group `init_data` and `input_data` in batch0: instead of it 
serialize `init_data` first, and then `input_data`;
   in worst case we're going to have one more chunk by not grouping them 
together, but winning by having much simpler logic on python side. 
   - do not create extra dataframes if not needed + copy empty one;
   
   ### Why are the changes needed?
   Benchmark results show that in high-cardinality scenarios, this optimization 
improves batch0 time by ~40%. No visible regressions for low cardinality case.
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   Existing UT and Benchmark:
   
   10,000,000 distinct keys in init state (8xi3.4xlarge):
       - Without Optimization: 11400 records/s
       - With Optimization: 30000 records/s
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to