[
https://issues.apache.org/jira/browse/SPARK-55336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated SPARK-55336:
-----------------------------------
Labels: pull-request-available (was: )
> Factor out ArrowStreamPandasSerializer._create_batch logic for createDataFrame
> ------------------------------------------------------------------------------
>
> Key: SPARK-55336
> URL: https://issues.apache.org/jira/browse/SPARK-55336
> Project: Spark
> Issue Type: Sub-task
> Components: PySpark
> Affects Versions: 4.2.0
> Reporter: Yicong Huang
> Priority: Major
> Labels: pull-request-available
>
> Currently, `_create_from_pandas_with_arrow` in `conversion.py` prepares
> batched series data and passes it to `ArrowStreamPandasSerializer`, which
> then calls `_create_batch` internally during serialization:
> {code:python}
> # conversion.py - _create_from_pandas_with_arrow
> batched_series = [
> [
> (series, spark_type)
> for (_, series), spark_type in zip(pdf_slice.items(), spark_types)
> ]
> for pdf_slice in pdf_slices
> ]
> ser = ArrowStreamPandasSerializer(...)
> jiter = self._sc._serialize_to_jvm(batched_series, ser, reader_func,
> create_iter_server)
> {code}
> {code:python}
> # serializers.py - ArrowStreamPandasSerializer._create_batch
> def _create_batch(self, series, *, prefers_large_types=False):
> arrs = [
> self._create_array(s, spark_type,
> prefers_large_types=prefers_large_types)
> for s, spark_type in series
> ]
> return pa.RecordBatch.from_arrays(arrs, ["_%d" % i for i in
> range(len(arrs))])
> {code}
> For better decoupling, `_create_batch` / `_create_array` logic could be
> factored out into standalone functions that `createDataFrame` can use
> directly to create Arrow RecordBatches from pandas data. This would:
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]