[GitHub] [spark] siying opened a new pull request, #42046: [SPARK-40434][SS] Implement applyInPandasWithState in PySpark

via GitHub Mon, 17 Jul 2023 17:33:29 -0700


siying opened a new pull request, #42046:
URL: https://github.com/apache/spark/pull/42046


   ### What changes were proposed in this pull request?
   Change the serialization format for group-by-with-state outputs: include an 
explicit hidden column indicating how many data and state records there are.
   
   ### Why are the changes needed?
   The current implementation of ApplyInPandasWithStatePythonRunner cannot deal 
with outputs where the first column of the row is null, as it cannot 
distinguish the case where the column is null, or the field is filled as the 
number of data records are smaller than state records. It causes incorrect 
results for the former case.
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   Add unit tests that cover null cases and different other scenarios.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] siying opened a new pull request, #42046: [SPARK-40434][SS] Implement applyInPandasWithState in PySpark

Reply via email to