Wes McKinney created SPARK-13534:
------------------------------------

             Summary: Implement Apache Arrow serializer for Spark DataFrame for 
use in DataFrame.toPandas
                 Key: SPARK-13534
                 URL: https://issues.apache.org/jira/browse/SPARK-13534
             Project: Spark
          Issue Type: New Feature
          Components: PySpark
            Reporter: Wes McKinney


The current code path for accessing Spark DataFrame data in Python using 
PySpark passes through an inefficient serialization-deserialiation process that 
I've examined at a high level here: 
https://gist.github.com/wesm/0cb5531b1c2e346a0007. Currently, RDD[Row] objects 
are being deserialized in pure Python as a list of tuples, which are then 
converted to pandas.DataFrame using its {{from_records}} alternate constructor. 
This also uses a large amount of memory.

For flat (no nested types) schemas, the Apache Arrow memory layout 
(https://github.com/apache/arrow/tree/master/format) can be deserialized to 
{{pandas.DataFrame}} objects with comparatively small overhead compared with 
memcpy / system memory bandwidth -- Arrow's bitmasks must be examined the the 
null values in the data have to be replace with pandas's sentinel values (None 
or NaN as appropriate).

I will be contributing patches to Arrow in the coming weeks for converting 
between Arrow and pandas in the general case, so if Spark can send Arrow memory 
to PySpark, we will hopefully be able to increase the Python data access 
throughput by an order of magnitude or more. I propose to add an new serializer 
for Spark DataFrame and an method that can be invoked from PySpark to request a 
Arrow memory-layout byte stream, prefixed by a data header indicating array 
buffer offsets and sizes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to