[
https://issues.apache.org/jira/browse/SPARK-55529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Takuya Ueshin reassigned SPARK-55529:
-------------------------------------
Assignee: Yicong Huang
> Optimize applyInPandas by restoring Arrow-level batch merge for non-iterator
> UDF
> --------------------------------------------------------------------------------
>
> Key: SPARK-55529
> URL: https://issues.apache.org/jira/browse/SPARK-55529
> Project: Spark
> Issue Type: Improvement
> Components: PySpark
> Affects Versions: 4.2.0
> Reporter: Yicong Huang
> Assignee: Yicong Huang
> Priority: Major
> Labels: pull-request-available
>
> After SPARK-54316 consolidated GroupPandasIterUDFSerializer into
> GroupPandasUDFSerializer, the non-iterator applyInPandas lost its efficient
> Arrow-level batch merge. SPARK-55459 partially fixed the 3x regression by
> optimizing the pandas concatenation strategy, but a ~1.5-2.5x regression
> remains compared to the pre-54316 baseline.
> Root cause: The current code converts each Arrow batch to pandas
> individually, then reassembles via pd.concat. The original code merged all
> Arrow batches into one pa.Table via pa.Table.from_batches() (near zero-copy),
> then converted to pandas once.
> Proposed fix:
> - GroupPandasUDFSerializer.load_stream yields raw Iterator[pa.RecordBatch]
> instead of converting per-batch
> - Split mapper: non-iterator UDF collects all batches and merges at Arrow
> level; iterator UDF converts per-batch lazily
> - Simplify wrap_grouped_map_pandas_udf to receive flat list[pd.Series]
> (pre-merged)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]