[PR] [SPARK-54316][PYTHON] Consolidate GroupPandasIterUDFSerializer with GroupPandasUDFSerializer [spark]

via GitHub Thu, 13 Nov 2025 11:58:41 -0800


Yicong-Huang opened a new pull request, #53043:
URL: https://github.com/apache/spark/pull/53043


   ### What changes were proposed in this pull request?
   
   This PR consolidates `GroupPandasIterUDFSerializer` with 
`GroupPandasUDFSerializer` to eliminate code duplication and improve 
maintainability.
   
   **Key changes:**
   
   **Modified `GroupPandasUDFSerializer`** 
(`python/pyspark/sql/pandas/serializers.py`):
      - Added `use_iterator` parameter to support both regular and iterator 
modes
      - Extracted common batch-to-pandas conversion logic into 
`_convert_batches_to_pandas` helper method
      - Unified `load_stream()` and `dump_stream()` methods to handle both 
modes with minimal branching
      - Both modes now use a single code path with conditional batch grouping
   
   ### Why are the changes needed?
   
   When `Iterator[pandas.DataFrame]` API was added to 
`groupBy().applyInPandas()` in SPARK-53614 (#52716), a new 
`GroupPandasIterUDFSerializer` class was created. However, this class is nearly 
identical to `GroupPandasUDFSerializer`, differing only in whether batches are 
processed lazily (iterator mode) or all at once (regular mode).
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   All existing tests pass without modification:
   - Iterator mode tests (11 tests): `test_apply_in_pandas_iterator_*`
   - Regular mode tests (39 tests): all other `ApplyInPandasTests`
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Co-Generated-by: Cursor with Claude 4.5 Sonnet
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-54316][PYTHON] Consolidate GroupPandasIterUDFSerializer with GroupPandasUDFSerializer [spark]

Reply via email to