[
https://issues.apache.org/jira/browse/SPARK-55388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yicong Huang updated SPARK-55388:
---------------------------------
Description:
In the current Python worker UDF execution path, the logic for each
{{PythonEvalType}} is spread across three layers:
- *Wrapper functions* (e.g. {{wrap_arrow_batch_iter_udf}},
{{wrap_scalar_arrow_udf}}, {{wrap_grouped_map_arrow_udf}}, etc.) that
pre-process UDF arguments and post-process results.
- *Mapper blocks* in {{read_udfs}} that handle input column selection, UDF
invocation, and output assembly — often shared across unrelated eval types.
- *Serializers* (e.g. {{ArrowStreamUDFSerializer}}) that mix data
transformation (struct flatten/wrap, pandas conversion) with I/O concerns.
To understand the full data flow for a single eval type, one has to trace
through all three layers and reconstruct the implicit contracts between them
(e.g. the serializer yields {{[batch]}} lists, the mapper indexes into them,
the wrapper attaches {{(result, type)}} tuples).
The goal is to make each eval type self-contained: all input transformation,
UDF invocation, result verification, and output transformation should be
co-located in one place in {{read_udfs}}, with serializers reduced to pure I/O.
This can be done incrementally, one eval type at a time, starting with the
simpler Arrow-based ones.
> Refactor eval types processing logic
> ------------------------------------
>
> Key: SPARK-55388
> URL: https://issues.apache.org/jira/browse/SPARK-55388
> Project: Spark
> Issue Type: Umbrella
> Components: PySpark
> Affects Versions: 4.2.0
> Reporter: Yicong Huang
> Priority: Major
>
> In the current Python worker UDF execution path, the logic for each
> {{PythonEvalType}} is spread across three layers:
> - *Wrapper functions* (e.g. {{wrap_arrow_batch_iter_udf}},
> {{wrap_scalar_arrow_udf}}, {{wrap_grouped_map_arrow_udf}}, etc.) that
> pre-process UDF arguments and post-process results.
> - *Mapper blocks* in {{read_udfs}} that handle input column selection, UDF
> invocation, and output assembly — often shared across unrelated eval types.
> - *Serializers* (e.g. {{ArrowStreamUDFSerializer}}) that mix data
> transformation (struct flatten/wrap, pandas conversion) with I/O concerns.
> To understand the full data flow for a single eval type, one has to trace
> through all three layers and reconstruct the implicit contracts between them
> (e.g. the serializer yields {{[batch]}} lists, the mapper indexes into them,
> the wrapper attaches {{(result, type)}} tuples).
> The goal is to make each eval type self-contained: all input transformation,
> UDF invocation, result verification, and output transformation should be
> co-located in one place in {{read_udfs}}, with serializers reduced to pure
> I/O.
> This can be done incrementally, one eval type at a time, starting with the
> simpler Arrow-based ones.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]