[
https://issues.apache.org/jira/browse/SPARK-47466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ian Cook updated SPARK-47466:
-----------------------------
Description:
As a follow-up to SPARK-47365:
{{toArrow()}} is useful when the data is relatively small. For larger data, the
best way to return the contents of a PySpark DataFrame in Arrow format is to
return an iterator of [PyArrow
RecordBatches|https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatch.html].
was:
As a follow-up to SPARK-47365:
*toArrow()* is useful when the data is relatively small. For larger data, the
best way to return the contents of a PySpark DataFrame in Arrow format is to
return an iterator of [PyArrow
RecordBatches|https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatch.html].
> Add PySpark DataFrame method to return iterator of PyArrow RecordBatches
> ------------------------------------------------------------------------
>
> Key: SPARK-47466
> URL: https://issues.apache.org/jira/browse/SPARK-47466
> Project: Spark
> Issue Type: Improvement
> Components: PySpark
> Affects Versions: 3.5.1
> Reporter: Ian Cook
> Priority: Major
>
> As a follow-up to SPARK-47365:
> {{toArrow()}} is useful when the data is relatively small. For larger data,
> the best way to return the contents of a PySpark DataFrame in Arrow format is
> to return an iterator of [PyArrow
> RecordBatches|https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatch.html].
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]