[GitHub] [spark] HyukjinKwon opened a new pull request, #41563: [SPARK-43798][PYTHON][FOLLOW-UP] Sets the batch size as 1 to send individual group

via GitHub Mon, 12 Jun 2023 21:00:10 -0700


HyukjinKwon opened a new pull request, #41563:
URL: https://github.com/apache/spark/pull/41563


   ### What changes were proposed in this pull request?
   
   This PR is a followup of https://github.com/apache/spark/pull/41316 that 
sets the correct batch size as 1 when it streams from Python to JVM.
   
   In addition, this PR contains slight cleanup.
   
   ### Why are the changes needed?
   
   Currently we're sending 100 groups instead of 100 rows (per batch), meaning 
that we will send 100 iterators from 100 UDTF invocations with different 
arguments. This is because one row in `BatchedSerializer` is actually one group 
from one UDTF invocation within a tuple. This PR sets the batch size as 1 for 
now.
   
   Ideally we should implement its custom logic to batch them per row-size. 
However, this requires to define an additional protocol (e.g., to signal which 
is the end of one UDTF invocation). Therefore, this PR fixes it with a 
minimized change for now.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No, UDTF is not released to the end users yet.
   
   ### How was this patch tested?
   
   Manually tested.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HyukjinKwon opened a new pull request, #41563: [SPARK-43798][PYTHON][FOLLOW-UP] Sets the batch size as 1 to send individual group

Reply via email to