[jira] [Updated] (SPARK-55529) Optimize applyInPandas by restoring Arrow-level batch merge for non-iterator UDF

Yicong Huang (Jira) Tue, 17 Feb 2026 10:58:26 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-55529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Yicong Huang updated SPARK-55529:
---------------------------------
    Description: 
After SPARK-54316 consolidated GroupPandasIterUDFSerializer into 
GroupPandasUDFSerializer, the non-iterator applyInPandas lost its efficient 
Arrow-level batch merge. SPARK-55459 partially fixed the 3x regression by 
optimizing the pandas concatenation strategy, but a ~1.5-2.5x regression 
remains compared to the pre-54316 baseline.

Root cause: The current code converts each Arrow batch to pandas individually, 
then reassembles via pd.concat. The original code merged all Arrow batches into 
one pa.Table via pa.Table.from_batches() (near zero-copy), then converted to 
pandas once.

Proposed fix:
- GroupPandasUDFSerializer.load_stream yields raw Iterator[pa.RecordBatch] 
instead of converting per-batch
- Split mapper: non-iterator UDF collects all batches and merges at Arrow 
level; iterator UDF converts per-batch lazily
- Simplify wrap_grouped_map_pandas_udf to receive flat list[pd.Series] 
(pre-merged)

  was:
After SPARK-54316 consolidated GroupPandasIterUDFSerializer into 
GroupPandasUDFSerializer, the non-iterator applyInPandas lost its efficient 
Arrow-level batch merge. SPARK-55459 partially fixed the 3x regression by 
optimizing the pandas concatenation strategy, but a ~1.5-2.5x regression 
remains compared to the pre-54316 baseline.

Root cause: The current code converts each Arrow batch to pandas individually, 
then reassembles via pd.concat. The original code merged all Arrow batches into 
one pa.Table via pa.Table.from_batches() (near zero-copy), then converted to 
pandas once.

Proposed fix:
- GroupPandasUDFSerializer.load_stream yields raw Iterator[pa.RecordBatch] 
instead of converting per-batch
- Split mapper: non-iterator UDF collects all batches and merges at Arrow 
level; iterator UDF converts per-batch lazily
- Simplify wrap_grouped_map_pandas_udf to receive flat list[pd.Series] 
(pre-merged)

Microbenchmark (Arrow-to-pandas hot path, large groups with few columns):
||Version||100K rows, 5 cols||1M rows, 5 cols||vs Nov baseline||
|Nov baseline (pre-54316)|1.19 ms|7.91 ms| — |
|Post-54316|2.59 ms|30.15 ms|2.2-3.8x slower|
|Post-55459 (current master)|1.81 ms|20.33 ms|1.5-2.6x slower|
|This PR|0.30 ms|1.38 ms|4.0-5.8x faster|


> Optimize applyInPandas by restoring Arrow-level batch merge for non-iterator 
> UDF
> --------------------------------------------------------------------------------
>
>                 Key: SPARK-55529
>                 URL: https://issues.apache.org/jira/browse/SPARK-55529
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>    Affects Versions: 4.2.0
>            Reporter: Yicong Huang
>            Priority: Major
>              Labels: pull-request-available
>
> After SPARK-54316 consolidated GroupPandasIterUDFSerializer into 
> GroupPandasUDFSerializer, the non-iterator applyInPandas lost its efficient 
> Arrow-level batch merge. SPARK-55459 partially fixed the 3x regression by 
> optimizing the pandas concatenation strategy, but a ~1.5-2.5x regression 
> remains compared to the pre-54316 baseline.
> Root cause: The current code converts each Arrow batch to pandas 
> individually, then reassembles via pd.concat. The original code merged all 
> Arrow batches into one pa.Table via pa.Table.from_batches() (near zero-copy), 
> then converted to pandas once.
> Proposed fix:
> - GroupPandasUDFSerializer.load_stream yields raw Iterator[pa.RecordBatch] 
> instead of converting per-batch
> - Split mapper: non-iterator UDF collects all batches and merges at Arrow 
> level; iterator UDF converts per-batch lazily
> - Simplify wrap_grouped_map_pandas_udf to receive flat list[pd.Series] 
> (pre-merged)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-55529) Optimize applyInPandas by restoring Arrow-level batch merge for non-iterator UDF

Reply via email to