[jira] [Updated] (SPARK-55388) Refactor eval types processing logic

Yicong Huang (Jira) Thu, 05 Feb 2026 22:44:07 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-55388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Yicong Huang updated SPARK-55388:
---------------------------------
    Description: 
In the current Python worker UDF execution path, the logic for each 
{{PythonEvalType}} is spread across three layers:

- *Wrapper functions* (e.g. {{wrap_arrow_batch_iter_udf}}, 
{{wrap_scalar_arrow_udf}}, {{wrap_grouped_map_arrow_udf}}, etc.) that 
pre-process UDF arguments and post-process results.
- *Mapper blocks* in {{read_udfs}} that handle input column selection, UDF 
invocation, and output assembly — often shared across unrelated eval types.
- *Serializers* (e.g. {{ArrowStreamUDFSerializer}}) that mix data 
transformation (struct flatten/wrap, pandas conversion) with I/O concerns.

To understand the full data flow for a single eval type, one has to trace 
through all three layers and reconstruct the implicit contracts between them 
(e.g. the serializer yields {{[batch]}} lists, the mapper indexes into them, 
the wrapper attaches {{(result, type)}} tuples).

The goal is to make each eval type self-contained: all input transformation, 
UDF invocation, result verification, and output transformation should be 
co-located in one place in {{read_udfs}}, with serializers reduced to pure I/O.

This can be done incrementally, one eval type at a time, starting with the 
simpler Arrow-based ones.

> Refactor eval types processing logic
> ------------------------------------
>
>                 Key: SPARK-55388
>                 URL: https://issues.apache.org/jira/browse/SPARK-55388
>             Project: Spark
>          Issue Type: Umbrella
>          Components: PySpark
>    Affects Versions: 4.2.0
>            Reporter: Yicong Huang
>            Priority: Major
>
> In the current Python worker UDF execution path, the logic for each 
> {{PythonEvalType}} is spread across three layers:
> - *Wrapper functions* (e.g. {{wrap_arrow_batch_iter_udf}}, 
> {{wrap_scalar_arrow_udf}}, {{wrap_grouped_map_arrow_udf}}, etc.) that 
> pre-process UDF arguments and post-process results.
> - *Mapper blocks* in {{read_udfs}} that handle input column selection, UDF 
> invocation, and output assembly — often shared across unrelated eval types.
> - *Serializers* (e.g. {{ArrowStreamUDFSerializer}}) that mix data 
> transformation (struct flatten/wrap, pandas conversion) with I/O concerns.
> To understand the full data flow for a single eval type, one has to trace 
> through all three layers and reconstruct the implicit contracts between them 
> (e.g. the serializer yields {{[batch]}} lists, the mapper indexes into them, 
> the wrapper attaches {{(result, type)}} tuples).
> The goal is to make each eval type self-contained: all input transformation, 
> UDF invocation, result verification, and output transformation should be 
> co-located in one place in {{read_udfs}}, with serializers reduced to pure 
> I/O.
> This can be done incrementally, one eval type at a time, starting with the 
> simpler Arrow-based ones.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-55388) Refactor eval types processing logic

Reply via email to