[ 
https://issues.apache.org/jira/browse/SPARK-55194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-55194.
-----------------------------------
    Fix Version/s: 4.2.0
       Resolution: Fixed

Issue resolved by pull request 53974
[https://github.com/apache/spark/pull/53974]

> Remove GroupArrowUDFSerializer by moving flatten logic to mapper
> ----------------------------------------------------------------
>
>                 Key: SPARK-55194
>                 URL: https://issues.apache.org/jira/browse/SPARK-55194
>             Project: Spark
>          Issue Type: Sub-task
>          Components: PySpark
>    Affects Versions: 4.2.0
>            Reporter: Yicong Huang
>            Assignee: Yicong Huang
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 4.2.0
>
>
> {{GroupArrowUDFSerializer}} exists only to add a {{flatten_struct}} call in 
> {{load_stream}}, inheriting everything else from 
> {{ArrowStreamGroupUDFSerializer}}:
> {code:python}
> class GroupArrowUDFSerializer(ArrowStreamGroupUDFSerializer):
>     def load_stream(self, stream):
>         for (batches,) in self._load_group_dataframes(stream, num_dfs=1):
>             batch_iter = map(ArrowBatchTransformer.flatten_struct, batches)
>             yield batch_iter
> {code}
> This creates an unnecessary inheritance layer. The flatten operation is a 
> data transformation that belongs closer to where it's used (the mapper), not 
> in the serializer.
> Proposal: Move {{flatten_struct}} to the mapper and delete 
> {{GroupArrowUDFSerializer}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to