[
https://issues.apache.org/jira/browse/SPARK-55349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Takuya Ueshin resolved SPARK-55349.
-----------------------------------
Assignee: Yicong Huang
Resolution: Done
Issue resolved by pull request 54125
https://github.com/apache/spark/pull/54125
> Consolidate pandas-to-Arrow conversion utilities in serializers
> ---------------------------------------------------------------
>
> Key: SPARK-55349
> URL: https://issues.apache.org/jira/browse/SPARK-55349
> Project: Spark
> Issue Type: Sub-task
> Components: PySpark
> Affects Versions: 4.2.0
> Reporter: Yicong Huang
> Assignee: Yicong Huang
> Priority: Major
>
> The pandas UDF serializers contain significant code duplication for
> converting pandas data to Arrow format. Multiple `_create_batch` and
> `_create_array` methods exist across different serializer classes with nearly
> identical logic:
> {code:python}
> # ArrowStreamPandasSerializer
> def _create_batch(self, series):
> arrs = []
> for s, t in series:
> # ... conversion logic ...
> return pa.RecordBatch.from_arrays(arrs, ...)
> # ArrowStreamPandasUDFSerializer
> def _create_batch(self, series):
> # ... similar conversion logic ...
> # ArrowStreamPandasUDTFSerializer
> def _create_array(self, series, spark_type):
> # ... conversion logic ...
> {code}
> This duplication makes the code harder to maintain and increases the risk of
> inconsistent behavior.
> Proposal: Extract the common conversion logic into a dedicated
> `PandasToArrowConversion` class in `conversion.py`:
> {code:python}
> class PandasToArrowConversion:
> @classmethod
> def dataframe_to_batch(cls, data, schema, ...) -> pa.RecordBatch: ...
>
> @classmethod
> def series_to_array(cls, series, spark_type, ...) -> pa.Array: ...
> {code}
> This reduces code duplication and provides a single, well-tested conversion
> path.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]