Tim Swast created SPARK-54337:
---------------------------------
Summary: Expose __dataframe__ interchange protocol on pyspark RDD,
SQL DataFrame, and pandas DataFrame APIs
Key: SPARK-54337
URL: https://issues.apache.org/jira/browse/SPARK-54337
Project: Spark
Issue Type: Improvement
Components: Input/Output
Affects Versions: 4.0.1
Reporter: Tim Swast
The `__dataframe__` interchange protocol
([https://data-apis.org/dataframe-protocol/latest/purpose_and_scope.html)]
enables easy integration with many packages across the Python data ecosystem.
This is especially true for visualization packages such as matplotlib and
Microsoft's Data Wrangler
([https://github.com/microsoft/vscode-data-wrangler/issues/555#issuecomment-3215797533).]
I believe this API would be useful across all dataframe-like objects, including:
* RDD
[https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.html]
* sql.DataFrame
[https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.html#pyspark.sql.DataFrame]
* pandas.DataFrame
[https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/frame.html]
Implementation-wise, this would likely download all data in memory. For
example, in BigQuery DataFrames, we expose this API by first serializing to
Arrow.
https://github.com/googleapis/python-bigquery-dataframes/blob/20ab469d29767a2f04fe02aa66797893ecd1c539/bigframes/core/interchange.py#L88
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]