[
https://issues.apache.org/jira/browse/SPARK-13534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15456990#comment-15456990
]
Frederick Reiss commented on SPARK-13534:
-----------------------------------------
[~wesmckinn], are you planning to work on this issue soon?
> Implement Apache Arrow serializer for Spark DataFrame for use in
> DataFrame.toPandas
> -----------------------------------------------------------------------------------
>
> Key: SPARK-13534
> URL: https://issues.apache.org/jira/browse/SPARK-13534
> Project: Spark
> Issue Type: New Feature
> Components: PySpark
> Reporter: Wes McKinney
>
> The current code path for accessing Spark DataFrame data in Python using
> PySpark passes through an inefficient serialization-deserialiation process
> that I've examined at a high level here:
> https://gist.github.com/wesm/0cb5531b1c2e346a0007. Currently, RDD[Row]
> objects are being deserialized in pure Python as a list of tuples, which are
> then converted to pandas.DataFrame using its {{from_records}} alternate
> constructor. This also uses a large amount of memory.
> For flat (no nested types) schemas, the Apache Arrow memory layout
> (https://github.com/apache/arrow/tree/master/format) can be deserialized to
> {{pandas.DataFrame}} objects with comparatively small overhead compared with
> memcpy / system memory bandwidth -- Arrow's bitmasks must be examined,
> replacing the corresponding null values with pandas's sentinel values (None
> or NaN as appropriate).
> I will be contributing patches to Arrow in the coming weeks for converting
> between Arrow and pandas in the general case, so if Spark can send Arrow
> memory to PySpark, we will hopefully be able to increase the Python data
> access throughput by an order of magnitude or more. I propose to add an new
> serializer for Spark DataFrame and a new method that can be invoked from
> PySpark to request a Arrow memory-layout byte stream, prefixed by a data
> header indicating array buffer offsets and sizes.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]