Tim Swast created SPARK-54337:
---------------------------------

             Summary: Expose __dataframe__ interchange protocol on pyspark RDD, 
SQL DataFrame, and pandas DataFrame APIs
                 Key: SPARK-54337
                 URL: https://issues.apache.org/jira/browse/SPARK-54337
             Project: Spark
          Issue Type: Improvement
          Components: Input/Output
    Affects Versions: 4.0.1
            Reporter: Tim Swast


The `__dataframe__` interchange protocol 
([https://data-apis.org/dataframe-protocol/latest/purpose_and_scope.html)] 
enables easy integration with many packages across the Python data ecosystem. 
This is especially true for visualization packages such as matplotlib and 
Microsoft's Data Wrangler 
([https://github.com/microsoft/vscode-data-wrangler/issues/555#issuecomment-3215797533).]

I believe this API would be useful across all dataframe-like objects, including:
 * RDD 
[https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.html]
 * sql.DataFrame 
[https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.html#pyspark.sql.DataFrame]
 * pandas.DataFrame 
[https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/frame.html]

Implementation-wise, this would likely download all data in memory. For 
example, in BigQuery DataFrames, we expose this API by first serializing to 
Arrow. 
https://github.com/googleapis/python-bigquery-dataframes/blob/20ab469d29767a2f04fe02aa66797893ecd1c539/bigframes/core/interchange.py#L88



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to