ianmcook opened a new pull request, #45481:
URL: https://github.com/apache/spark/pull/45481

   ### What changes were proposed in this pull request?
   This adds an experimental PySpark DataFrame method `_toArrow()`. This 
returns the contents of the DataFrame as a [PyArrow 
Table](https://arrow.apache.org/docs/python/generated/pyarrow.Table.html).
   
   ### Why are the changes needed?
   In the Apache Arrow community, we hear from a lot of users who want to 
return the contents of a PySpark DataFrame as a PyArrow Table. Currently the 
only documented way to do this is to return the contents as a pandas DataFrame, 
then use PyArrow (`pa`) to convert that to a PyArrow Table.
   ```py
   pa.Table.from_pandas(df.toPandas())
   ```
   Going through pandas adds significant overhead which is easily avoided since 
internally `toPandas()` already converts the contents of Spark DataFrame to 
Arrow format when `spark.sql.execution.arrow.pyspark.enabled` is `true`.
   
   Currently it is also possible to use the experimental `_collect_as_arrow()` 
method to return the contents of a PySpark DataFrame as a list of PyArrow 
RecordBatches. This PR adds another experimental method `_toArrow()` which 
builds on that and returns the more user-friendly PyArrow Table object. It 
handles the case where the DataFrame has zero rows, returning a zero-row 
PyArrow Table with the schema included (whereas `_collect_as_arrow()` returns 
an empty Python list).
   
   ### Does this PR introduce _any_ user-facing change?
   It adds an experimental DataFrame method `_toArrow()` to the PySpark SQL 
DataFrame API. It does not introduce any other user-facing changes.
   
   ### How was this patch tested?
   Tests are TBD
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to