Yicong Huang created SPARK-55349:
------------------------------------

             Summary: Consolidate pandas-to-Arrow conversion utilities in 
serializers
                 Key: SPARK-55349
                 URL: https://issues.apache.org/jira/browse/SPARK-55349
             Project: Spark
          Issue Type: Sub-task
          Components: PySpark
    Affects Versions: 4.2.0
            Reporter: Yicong Huang



The pandas UDF serializers contain significant code duplication for converting 
pandas data to Arrow format. Multiple `_create_batch` and `_create_array` 
methods exist across different serializer classes with nearly identical logic:

{code:python}
# ArrowStreamPandasSerializer
def _create_batch(self, series):
    arrs = []
    for s, t in series:
        # ... conversion logic ...
    return pa.RecordBatch.from_arrays(arrs, ...)

# ArrowStreamPandasUDFSerializer  
def _create_batch(self, series):
    # ... similar conversion logic ...

# ArrowStreamPandasUDTFSerializer
def _create_array(self, series, spark_type):
    # ... conversion logic ...
{code}

This duplication makes the code harder to maintain and increases the risk of 
inconsistent behavior.

Proposal: Extract the common conversion logic into a dedicated 
`PandasToArrowConversion` class in `conversion.py`:

{code:python}
class PandasToArrowConversion:
    @classmethod
    def dataframe_to_batch(cls, data, schema, ...) -> pa.RecordBatch: ...
    
    @classmethod
    def series_to_array(cls, series, spark_type, ...) -> pa.Array: ...
{code}

This reduces code duplication and provides a single, well-tested conversion 
path.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to