[PR] [SPARK-56648][PYTHON][4.X] Refactor SQL_SCALAR_PANDAS_UDF [spark]

via GitHub Tue, 09 Jun 2026 14:09:40 -0700


Yicong-Huang opened a new pull request, #56415:
URL: https://github.com/apache/spark/pull/56415


   ### What changes were proposed in this pull request?
   
   Backport of https://github.com/apache/spark/pull/55613 (commit 1f5d0e32531) 
to `branch-4.x`. Clean cherry-pick, no code changes.
   
   Refactor `SQL_SCALAR_PANDAS_UDF` to use `ArrowStreamSerializer` as pure I/O, 
moving Arrow-to-Pandas and Pandas-to-Arrow conversion logic from 
`ArrowStreamPandasUDFSerializer` into `read_udfs()` in `worker.py`.
   
   Specifically:
   - Remove the dedicated `wrap_scalar_pandas_udf` wrapper.
   - Route `SQL_SCALAR_PANDAS_UDF` through 
`ArrowStreamSerializer(write_start_stream=True)`.
   - In `read_udfs()`, add a self-contained handler that:
     - Converts each Arrow `RecordBatch` to pandas Series via 
`ArrowBatchTransformer.to_pandas()` (with `struct_in_pandas="dict"`, 
`df_for_struct=True`, `ndarray_as_list=False`).
     - Invokes each UDF column-wise on the pandas inputs and validates the 
return type (must be array-like) and row count (must match input).
     - Enforces the existing rule that struct return types must be 
`pandas.DataFrame`.
     - Converts results back to an Arrow `RecordBatch` via 
`PandasToArrowConversion.convert()`.
   
   ### Why are the changes needed?
   
   Part of [SPARK-55388](https://issues.apache.org/jira/browse/SPARK-55388). 
Backporting to `branch-4.x` keeps the eval-type processing paths consistent 
with master and unblocks the backport of follow-up refactors (e.g. SPARK-56758) 
that build on this change.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Existing tests. No behavior change. See 
https://github.com/apache/spark/pull/55613 for test details and ASV benchmark 
results (latency within +-5% noise, peak memory flat).
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No (clean cherry-pick of the original commit).
   
   This pull request and its description were written by Isaac.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-56648][PYTHON][4.X] Refactor SQL_SCALAR_PANDAS_UDF [spark]

Reply via email to