Yicong Huang created SPARK-55336:
------------------------------------
Summary: Factor out ArrowStreamPandasSerializer._create_batch
logic for createDataFrame
Key: SPARK-55336
URL: https://issues.apache.org/jira/browse/SPARK-55336
Project: Spark
Issue Type: Sub-task
Components: PySpark
Affects Versions: 4.2.0
Reporter: Yicong Huang
Currently, `_create_from_pandas_with_arrow` in `conversion.py` prepares batched
series data and passes it to `ArrowStreamPandasSerializer`, which then calls
`_create_batch` internally during serialization:
{code:python}
# conversion.py - _create_from_pandas_with_arrow
batched_series = [
[
(series, spark_type)
for (_, series), spark_type in zip(pdf_slice.items(), spark_types)
]
for pdf_slice in pdf_slices
]
ser = ArrowStreamPandasSerializer(...)
jiter = self._sc._serialize_to_jvm(batched_series, ser, reader_func,
create_iter_server)
{code}
{code:python}
# serializers.py - ArrowStreamPandasSerializer._create_batch
def _create_batch(self, series, *, prefers_large_types=False):
arrs = [
self._create_array(s, spark_type,
prefers_large_types=prefers_large_types)
for s, spark_type in series
]
return pa.RecordBatch.from_arrays(arrs, ["_%d" % i for i in
range(len(arrs))])
{code}
For better decoupling, `_create_batch` / `_create_array` logic could be
factored out into standalone functions that `createDataFrame` can use directly
to create Arrow RecordBatches from pandas data. This would:
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]