siying opened a new pull request, #42046: URL: https://github.com/apache/spark/pull/42046
### What changes were proposed in this pull request? Change the serialization format for group-by-with-state outputs: include an explicit hidden column indicating how many data and state records there are. ### Why are the changes needed? The current implementation of ApplyInPandasWithStatePythonRunner cannot deal with outputs where the first column of the row is null, as it cannot distinguish the case where the column is null, or the field is filled as the number of data records are smaller than state records. It causes incorrect results for the former case. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add unit tests that cover null cases and different other scenarios. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
