[PR] [SPARK-55170][PYTHON] Extract grouped stream reading pattern from serializers [spark]

via GitHub Sat, 24 Jan 2026 14:13:51 -0800


Yicong-Huang opened a new pull request, #53953:
URL: https://github.com/apache/spark/pull/53953


   ### What changes were proposed in this pull request?
   
   This PR extracts the common grouped stream reading pattern from multiple 
serializers into a reusable method `_load_dataframe_groups` in 
`ArrowStreamSerializer`.
   
   Multiple serializers previously duplicated the same pattern for reading 
grouped data from stream:
   - `GroupArrowUDFSerializer`
   - `ArrowStreamAggArrowUDFSerializer`
   - `ArrowStreamAggPandasUDFSerializer`
   - `GroupPandasUDFSerializer`
   - `CogroupArrowUDFSerializer`
   - `CogroupPandasUDFSerializer`
   
   The duplicated pattern was:
   ```python
   dataframes_in_group = None
   while dataframes_in_group is None or dataframes_in_group > 0:
       dataframes_in_group = read_int(stream)
       if dataframes_in_group == EXPECTED_COUNT:
           # process group
           yield result
       elif dataframes_in_group != 0:
           raise PySparkValueError(
               errorClass="INVALID_NUMBER_OF_DATAFRAMES_IN_GROUP",
               messageParameters={"dataframes_in_group": 
str(dataframes_in_group)},
           )
   ```
   
   This has been extracted into a method in `ArrowStreamSerializer`:
   ```python
   def _load_dataframe_groups(self, stream, num_dataframes: int = 1):
       """
       Load groups with specified number of dataframes from stream.
       Yields a tuple of iterators of RecordBatches for each group.
       """
       ...
   ```
   
   ### Why are the changes needed?
   
   This is part of Phase 2: Reduce serializer complexity (SPARK-55159)
   1. Reduces code duplication across 6 serializers (~35 lines net reduction)
   2. Makes the code easier to maintain and understand
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Existing tests。
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-55170][PYTHON] Extract grouped stream reading pattern from serializers [spark]

Reply via email to