[jira] [Commented] (SPARK-20791) Use Apache Arrow to Improve Spark createDataFrame from Pandas.DataFrame

Bryan Cutler (JIRA) Wed, 17 May 2017 17:18:16 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-20791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16014976#comment-16014976
 ]


Bryan Cutler commented on SPARK-20791:
--------------------------------------

I will work on this pending SPARK-13534 being merged.

> Use Apache Arrow to Improve Spark createDataFrame from Pandas.DataFrame
> -----------------------------------------------------------------------
>
>                 Key: SPARK-20791
>                 URL: https://issues.apache.org/jira/browse/SPARK-20791
>             Project: Spark
>          Issue Type: New Feature
>          Components: PySpark, SQL
>    Affects Versions: 2.1.1
>            Reporter: Bryan Cutler
>
> The current code for creating a Spark DataFrame from a Pandas DataFrame uses 
> `to_records` to convert the DataFrame to a list of records and then converts 
> each record to a list.  Following this, there are a number of calls to 
> serialize and transfer this data to the JVM.  This process is very 
> inefficient and also discards all schema metadata, requiring another pass 
> over the data to infer types.
> Using Apache Arrow, the Pandas DataFrame could be efficiently converted to 
> Arrow data and directly transferred to the JVM to create the Spark DataFrame. 
>  The performance will be better and the Pandas schema will also be used so 
> that the correct types will be used.  
> Issues with the poor type inference have come up before, causing confusion 
> and frustration with users because it is not clear why it fails or doesn't 
> use the same type from Pandas.  Fixing this with Apache Arrow will solve 
> another pain point for Python users and the following JIRAs could be closed:
> * SPARK-17804
> * SPARK-18178



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-20791) Use Apache Arrow to Improve Spark createDataFrame from Pandas.DataFrame

Reply via email to