Yicong-Huang opened a new pull request, #53043:
URL: https://github.com/apache/spark/pull/53043
### What changes were proposed in this pull request?
This PR consolidates `GroupPandasIterUDFSerializer` with
`GroupPandasUDFSerializer` to eliminate code duplication and improve
maintainability.
**Key changes:**
**Modified `GroupPandasUDFSerializer`**
(`python/pyspark/sql/pandas/serializers.py`):
- Added `use_iterator` parameter to support both regular and iterator
modes
- Extracted common batch-to-pandas conversion logic into
`_convert_batches_to_pandas` helper method
- Unified `load_stream()` and `dump_stream()` methods to handle both
modes with minimal branching
- Both modes now use a single code path with conditional batch grouping
### Why are the changes needed?
When `Iterator[pandas.DataFrame]` API was added to
`groupBy().applyInPandas()` in SPARK-53614 (#52716), a new
`GroupPandasIterUDFSerializer` class was created. However, this class is nearly
identical to `GroupPandasUDFSerializer`, differing only in whether batches are
processed lazily (iterator mode) or all at once (regular mode).
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
All existing tests pass without modification:
- Iterator mode tests (11 tests): `test_apply_in_pandas_iterator_*`
- Regular mode tests (39 tests): all other `ApplyInPandasTests`
### Was this patch authored or co-authored using generative AI tooling?
Co-Generated-by: Cursor with Claude 4.5 Sonnet
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]