[
https://issues.apache.org/jira/browse/SPARK-55183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated SPARK-55183:
-----------------------------------
Labels: pull-request-available (was: )
> Extract assign_cols_by_name transformer from ArrowStreamGroupUDFSerializer
> --------------------------------------------------------------------------
>
> Key: SPARK-55183
> URL: https://issues.apache.org/jira/browse/SPARK-55183
> Project: Spark
> Issue Type: Sub-task
> Components: PySpark
> Affects Versions: 4.2.0
> Reporter: Yicong Huang
> Priority: Major
> Labels: pull-request-available
>
> Problem:
> ArrowStreamGroupUDFSerializer.dump_stream contains inline logic to reorder
> RecordBatch columns to match the expected schema order when
> `assign_cols_by_name=True`. This pattern mixes data transformation with
> serialization logic.
> Current code (ArrowStreamGroupUDFSerializer.dump_stream):
> ```python
> if self._assign_cols_by_name:
> batch_iter = (
> (
> pa.RecordBatch.from_arrays(
> [batch.column(field.name) for field in arrow_type],
> names=[field.name for field in arrow_type],
> ),
> arrow_type,
> )
> for batch, arrow_type in batch_iter
> )
> ```
> Proposal:
> Extract this column reordering transformation into ArrowBatchTransformer as a
> reusable pure function.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]