[ 
https://issues.apache.org/jira/browse/SPARK-55388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yicong Huang updated SPARK-55388:
---------------------------------
    Summary: Refactor PythonEvalType processing logic  (was: Refactor eval 
types processing logic)

> Refactor PythonEvalType processing logic
> ----------------------------------------
>
>                 Key: SPARK-55388
>                 URL: https://issues.apache.org/jira/browse/SPARK-55388
>             Project: Spark
>          Issue Type: Umbrella
>          Components: PySpark
>    Affects Versions: 4.2.0
>            Reporter: Yicong Huang
>            Priority: Major
>
> In the current Python worker UDF execution path, the logic for each 
> {{PythonEvalType}} is spread across three layers:
> - *Wrapper functions* (e.g. {{wrap_arrow_batch_iter_udf}}, 
> {{wrap_scalar_arrow_udf}}, {{wrap_grouped_map_arrow_udf}}, etc.) that 
> pre-process UDF arguments and post-process results.
> - *Mapper blocks* in {{read_udfs}} that handle input column selection, UDF 
> invocation, and output assembly — often shared across unrelated eval types.
> - *Serializers* (e.g. {{ArrowStreamUDFSerializer}}) that mix data 
> transformation (struct flatten/wrap, pandas conversion) with I/O concerns.
> To understand the full data flow for a single eval type, one has to trace 
> through all three layers and reconstruct the implicit contracts between them 
> (e.g. the serializer yields {{[batch]}} lists, the mapper indexes into them, 
> the wrapper attaches {{(result, type)}} tuples).
> The goal is to make each eval type self-contained: all input transformation, 
> UDF invocation, result verification, and output transformation should be 
> co-located in one place in {{read_udfs}}, with serializers reduced to pure 
> I/O.
> This can be done incrementally, one eval type at a time, starting with the 
> simpler Arrow-based ones.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to