[
https://issues.apache.org/jira/browse/SPARK-20791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16014976#comment-16014976
]
Bryan Cutler commented on SPARK-20791:
--------------------------------------
I will work on this pending SPARK-13534 being merged.
> Use Apache Arrow to Improve Spark createDataFrame from Pandas.DataFrame
> -----------------------------------------------------------------------
>
> Key: SPARK-20791
> URL: https://issues.apache.org/jira/browse/SPARK-20791
> Project: Spark
> Issue Type: New Feature
> Components: PySpark, SQL
> Affects Versions: 2.1.1
> Reporter: Bryan Cutler
>
> The current code for creating a Spark DataFrame from a Pandas DataFrame uses
> `to_records` to convert the DataFrame to a list of records and then converts
> each record to a list. Following this, there are a number of calls to
> serialize and transfer this data to the JVM. This process is very
> inefficient and also discards all schema metadata, requiring another pass
> over the data to infer types.
> Using Apache Arrow, the Pandas DataFrame could be efficiently converted to
> Arrow data and directly transferred to the JVM to create the Spark DataFrame.
> The performance will be better and the Pandas schema will also be used so
> that the correct types will be used.
> Issues with the poor type inference have come up before, causing confusion
> and frustration with users because it is not clear why it fails or doesn't
> use the same type from Pandas. Fixing this with Apache Arrow will solve
> another pain point for Python users and the following JIRAs could be closed:
> * SPARK-17804
> * SPARK-18178
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]