Yicong Huang created SPARK-55349:
------------------------------------
Summary: Consolidate pandas-to-Arrow conversion utilities in
serializers
Key: SPARK-55349
URL: https://issues.apache.org/jira/browse/SPARK-55349
Project: Spark
Issue Type: Sub-task
Components: PySpark
Affects Versions: 4.2.0
Reporter: Yicong Huang
The pandas UDF serializers contain significant code duplication for converting
pandas data to Arrow format. Multiple `_create_batch` and `_create_array`
methods exist across different serializer classes with nearly identical logic:
{code:python}
# ArrowStreamPandasSerializer
def _create_batch(self, series):
arrs = []
for s, t in series:
# ... conversion logic ...
return pa.RecordBatch.from_arrays(arrs, ...)
# ArrowStreamPandasUDFSerializer
def _create_batch(self, series):
# ... similar conversion logic ...
# ArrowStreamPandasUDTFSerializer
def _create_array(self, series, spark_type):
# ... conversion logic ...
{code}
This duplication makes the code harder to maintain and increases the risk of
inconsistent behavior.
Proposal: Extract the common conversion logic into a dedicated
`PandasToArrowConversion` class in `conversion.py`:
{code:python}
class PandasToArrowConversion:
@classmethod
def dataframe_to_batch(cls, data, schema, ...) -> pa.RecordBatch: ...
@classmethod
def series_to_array(cls, series, spark_type, ...) -> pa.Array: ...
{code}
This reduces code duplication and provides a single, well-tested conversion
path.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]