[
https://issues.apache.org/jira/browse/SPARK-29040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dongjoon Hyun updated SPARK-29040:
----------------------------------
Affects Version/s: (was: 3.0.0)
3.1.0
> Support pyspark.createDataFrame from a pyarrow.Table
> ----------------------------------------------------
>
> Key: SPARK-29040
> URL: https://issues.apache.org/jira/browse/SPARK-29040
> Project: Spark
> Issue Type: Improvement
> Components: PySpark, SQL
> Affects Versions: 3.1.0
> Reporter: Bryan Cutler
> Priority: Major
>
> PySpark {{createDataFrame}} currently supports creating a spark DataFrame
> from Pandas, using Arrow if enabled. This could be extended to accept a
> {{pyarrow.Table}} which has the added benefit of being able to efficiently
> use columns with nested struct types.
> It is possible to convert a pyarrow.Table with nested columns into a
> pandas.DataFrame, but the data becomes dictionaries, and is not a performant
> way to parallelize the data.
> Time/Date columns would need to be handled specially, since pyspark currently
> uses pandas to convert Arrow data of these types to the required Spark
> internal format.
> This follows from a mailing list discussion at
> http://apache-spark-user-list.1001560.n3.nabble.com/question-about-pyarrow-Table-to-pyspark-DataFrame-conversion-td36110.html
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]