Yicong Huang created SPARK-55336:
------------------------------------

             Summary: Factor out ArrowStreamPandasSerializer._create_batch 
logic for createDataFrame
                 Key: SPARK-55336
                 URL: https://issues.apache.org/jira/browse/SPARK-55336
             Project: Spark
          Issue Type: Sub-task
          Components: PySpark
    Affects Versions: 4.2.0
            Reporter: Yicong Huang


Currently, `_create_from_pandas_with_arrow` in `conversion.py` prepares batched 
series data and passes it to `ArrowStreamPandasSerializer`, which then calls 
`_create_batch` internally during serialization:

{code:python}
# conversion.py - _create_from_pandas_with_arrow
batched_series = [
    [
        (series, spark_type)
        for (_, series), spark_type in zip(pdf_slice.items(), spark_types)
    ]
    for pdf_slice in pdf_slices
]

ser = ArrowStreamPandasSerializer(...)
jiter = self._sc._serialize_to_jvm(batched_series, ser, reader_func, 
create_iter_server)
{code}

{code:python}
# serializers.py - ArrowStreamPandasSerializer._create_batch
def _create_batch(self, series, *, prefers_large_types=False):
    arrs = [
        self._create_array(s, spark_type, 
prefers_large_types=prefers_large_types)
        for s, spark_type in series
    ]
    return pa.RecordBatch.from_arrays(arrs, ["_%d" % i for i in 
range(len(arrs))])
{code}

For better decoupling, `_create_batch` / `_create_array` logic could be 
factored out into standalone functions that `createDataFrame` can use directly 
to create Arrow RecordBatches from pandas data. This would:




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to