holdenk commented on SPARK-13534:

And now they have a release :) I'm not certain its at the stage where we can 
use it - but I'll do some poking over the next few weeks :)

> Implement Apache Arrow serializer for Spark DataFrame for use in 
> DataFrame.toPandas
> -----------------------------------------------------------------------------------
>                 Key: SPARK-13534
>                 URL: https://issues.apache.org/jira/browse/SPARK-13534
>             Project: Spark
>          Issue Type: New Feature
>          Components: PySpark
>            Reporter: Wes McKinney
> The current code path for accessing Spark DataFrame data in Python using 
> PySpark passes through an inefficient serialization-deserialiation process 
> that I've examined at a high level here: 
> https://gist.github.com/wesm/0cb5531b1c2e346a0007. Currently, RDD[Row] 
> objects are being deserialized in pure Python as a list of tuples, which are 
> then converted to pandas.DataFrame using its {{from_records}} alternate 
> constructor. This also uses a large amount of memory.
> For flat (no nested types) schemas, the Apache Arrow memory layout 
> (https://github.com/apache/arrow/tree/master/format) can be deserialized to 
> {{pandas.DataFrame}} objects with comparatively small overhead compared with 
> memcpy / system memory bandwidth -- Arrow's bitmasks must be examined, 
> replacing the corresponding null values with pandas's sentinel values (None 
> or NaN as appropriate).
> I will be contributing patches to Arrow in the coming weeks for converting 
> between Arrow and pandas in the general case, so if Spark can send Arrow 
> memory to PySpark, we will hopefully be able to increase the Python data 
> access throughput by an order of magnitude or more. I propose to add an new 
> serializer for Spark DataFrame and a new method that can be invoked from 
> PySpark to request a Arrow memory-layout byte stream, prefixed by a data 
> header indicating array buffer offsets and sizes.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to