[
https://issues.apache.org/jira/browse/SPARK-20791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Li Jin updated SPARK-20791:
---------------------------
Issue Type: Sub-task (was: New Feature)
Parent: SPARK-22216
> Use Apache Arrow to Improve Spark createDataFrame from Pandas.DataFrame
> -----------------------------------------------------------------------
>
> Key: SPARK-20791
> URL: https://issues.apache.org/jira/browse/SPARK-20791
> Project: Spark
> Issue Type: Sub-task
> Components: PySpark, SQL
> Affects Versions: 2.1.1
> Reporter: Bryan Cutler
>
> The current code for creating a Spark DataFrame from a Pandas DataFrame uses
> `to_records` to convert the DataFrame to a list of records and then converts
> each record to a list. Following this, there are a number of calls to
> serialize and transfer this data to the JVM. This process is very
> inefficient and also discards all schema metadata, requiring another pass
> over the data to infer types.
> Using Apache Arrow, the Pandas DataFrame could be efficiently converted to
> Arrow data and directly transferred to the JVM to create the Spark DataFrame.
> The performance will be better and the Pandas schema will also be used so
> that the correct types will be used.
> Issues with the poor type inference have come up before, causing confusion
> and frustration with users because it is not clear why it fails or doesn't
> use the same type from Pandas. Fixing this with Apache Arrow will solve
> another pain point for Python users and the following JIRAs could be closed:
> * SPARK-17804
> * SPARK-18178
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]