[ 
https://issues.apache.org/jira/browse/SPARK-20791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-20791:
------------------------------------

    Assignee: Bryan Cutler

> Use Apache Arrow to Improve Spark createDataFrame from Pandas.DataFrame
> -----------------------------------------------------------------------
>
>                 Key: SPARK-20791
>                 URL: https://issues.apache.org/jira/browse/SPARK-20791
>             Project: Spark
>          Issue Type: Sub-task
>          Components: PySpark, SQL
>    Affects Versions: 2.1.1
>            Reporter: Bryan Cutler
>            Assignee: Bryan Cutler
>             Fix For: 2.3.0
>
>
> The current code for creating a Spark DataFrame from a Pandas DataFrame uses 
> `to_records` to convert the DataFrame to a list of records and then converts 
> each record to a list.  Following this, there are a number of calls to 
> serialize and transfer this data to the JVM.  This process is very 
> inefficient and also discards all schema metadata, requiring another pass 
> over the data to infer types.
> Using Apache Arrow, the Pandas DataFrame could be efficiently converted to 
> Arrow data and directly transferred to the JVM to create the Spark DataFrame. 
>  The performance will be better and the Pandas schema will also be used so 
> that the correct types will be used.  
> Issues with the poor type inference have come up before, causing confusion 
> and frustration with users because it is not clear why it fails or doesn't 
> use the same type from Pandas.  Fixing this with Apache Arrow will solve 
> another pain point for Python users and the following JIRAs could be closed:
> * SPARK-17804
> * SPARK-18178



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to