ianmcook opened a new pull request, #45481: URL: https://github.com/apache/spark/pull/45481
### What changes were proposed in this pull request? This adds an experimental PySpark DataFrame method `_toArrow()`. This returns the contents of the DataFrame as a [PyArrow Table](https://arrow.apache.org/docs/python/generated/pyarrow.Table.html). ### Why are the changes needed? In the Apache Arrow community, we hear from a lot of users who want to return the contents of a PySpark DataFrame as a PyArrow Table. Currently the only documented way to do this is to return the contents as a pandas DataFrame, then use PyArrow (`pa`) to convert that to a PyArrow Table. ```py pa.Table.from_pandas(df.toPandas()) ``` Going through pandas adds significant overhead which is easily avoided since internally `toPandas()` already converts the contents of Spark DataFrame to Arrow format when `spark.sql.execution.arrow.pyspark.enabled` is `true`. Currently it is also possible to use the experimental `_collect_as_arrow()` method to return the contents of a PySpark DataFrame as a list of PyArrow RecordBatches. This PR adds another experimental method `_toArrow()` which builds on that and returns the more user-friendly PyArrow Table object. It handles the case where the DataFrame has zero rows, returning a zero-row PyArrow Table with the schema included (whereas `_collect_as_arrow()` returns an empty Python list). ### Does this PR introduce _any_ user-facing change? It adds an experimental DataFrame method `_toArrow()` to the PySpark SQL DataFrame API. It does not introduce any other user-facing changes. ### How was this patch tested? Tests are TBD ### Was this patch authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
