[jira] [Commented] (SPARK-13534) Implement Apache Arrow serializer for Spark DataFrame for use in DataFrame.toPandas

Bryan Cutler (JIRA) Wed, 12 Jul 2017 10:55:50 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-13534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16084390#comment-16084390
 ]


Bryan Cutler commented on SPARK-13534:
--------------------------------------

Hi [~tagar], the {{ArrowSerializer}} doesn't quite fit as a drop-in replacement 
because the standard PySpark serializers use iterators over elements and Arrow 
works on batches.  Trying to iterate over the batches to get individual 
elements would probably cancel out any performance gains.  So then you would 
need to operate on the data with an interface like Pandas.  I proposed 
something similar in my comment 
[here|https://issues.apache.org/jira/browse/SPARK-21190?focusedCommentId=16077390&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16077390]
 (some api 
[details|https://gist.github.com/BryanCutler/2d2ae04e81fa96ba4b61dc095726419f]).
  I'd like to hear what your use case is for working with Arrow data and what 
you'd want to see in Spark to support this?

> Implement Apache Arrow serializer for Spark DataFrame for use in 
> DataFrame.toPandas
> -----------------------------------------------------------------------------------
>
>                 Key: SPARK-13534
>                 URL: https://issues.apache.org/jira/browse/SPARK-13534
>             Project: Spark
>          Issue Type: Sub-task
>          Components: PySpark
>    Affects Versions: 2.1.0
>            Reporter: Wes McKinney
>            Assignee: Bryan Cutler
>             Fix For: 2.3.0
>
>         Attachments: benchmark.py
>
>
> The current code path for accessing Spark DataFrame data in Python using 
> PySpark passes through an inefficient serialization-deserialiation process 
> that I've examined at a high level here: 
> https://gist.github.com/wesm/0cb5531b1c2e346a0007. Currently, RDD[Row] 
> objects are being deserialized in pure Python as a list of tuples, which are 
> then converted to pandas.DataFrame using its {{from_records}} alternate 
> constructor. This also uses a large amount of memory.
> For flat (no nested types) schemas, the Apache Arrow memory layout 
> (https://github.com/apache/arrow/tree/master/format) can be deserialized to 
> {{pandas.DataFrame}} objects with comparatively small overhead compared with 
> memcpy / system memory bandwidth -- Arrow's bitmasks must be examined, 
> replacing the corresponding null values with pandas's sentinel values (None 
> or NaN as appropriate).
> I will be contributing patches to Arrow in the coming weeks for converting 
> between Arrow and pandas in the general case, so if Spark can send Arrow 
> memory to PySpark, we will hopefully be able to increase the Python data 
> access throughput by an order of magnitude or more. I propose to add an new 
> serializer for Spark DataFrame and a new method that can be invoked from 
> PySpark to request a Arrow memory-layout byte stream, prefixed by a data 
> header indicating array buffer offsets and sizes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-13534) Implement Apache Arrow serializer for Spark DataFrame for use in DataFrame.toPandas

Reply via email to