[ 
https://issues.apache.org/jira/browse/SPARK-55194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-55194:
-----------------------------------
    Labels: pull-request-available  (was: )

> Remove GroupArrowUDFSerializer by moving flatten logic to mapper
> ----------------------------------------------------------------
>
>                 Key: SPARK-55194
>                 URL: https://issues.apache.org/jira/browse/SPARK-55194
>             Project: Spark
>          Issue Type: Sub-task
>          Components: PySpark
>    Affects Versions: 4.2.0
>            Reporter: Yicong Huang
>            Priority: Major
>              Labels: pull-request-available
>
> {{GroupArrowUDFSerializer}} exists only to add a {{flatten_struct}} call in 
> {{load_stream}}, inheriting everything else from 
> {{ArrowStreamGroupUDFSerializer}}:
> {code:python}
> class GroupArrowUDFSerializer(ArrowStreamGroupUDFSerializer):
>     def load_stream(self, stream):
>         for (batches,) in self._load_group_dataframes(stream, num_dfs=1):
>             batch_iter = map(ArrowBatchTransformer.flatten_struct, batches)
>             yield batch_iter
> {code}
> This creates an unnecessary inheritance layer. The flatten operation is a 
> data transformation that belongs closer to where it's used (the mapper), not 
> in the serializer.
> Proposal: Move {{flatten_struct}} to the mapper and delete 
> {{GroupArrowUDFSerializer}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to