Wes McKinney created SPARK-13534:
------------------------------------
Summary: Implement Apache Arrow serializer for Spark DataFrame for
use in DataFrame.toPandas
Key: SPARK-13534
URL: https://issues.apache.org/jira/browse/SPARK-13534
Project: Spark
Issue Type: New Feature
Components: PySpark
Reporter: Wes McKinney
The current code path for accessing Spark DataFrame data in Python using
PySpark passes through an inefficient serialization-deserialiation process that
I've examined at a high level here:
https://gist.github.com/wesm/0cb5531b1c2e346a0007. Currently, RDD[Row] objects
are being deserialized in pure Python as a list of tuples, which are then
converted to pandas.DataFrame using its {{from_records}} alternate constructor.
This also uses a large amount of memory.
For flat (no nested types) schemas, the Apache Arrow memory layout
(https://github.com/apache/arrow/tree/master/format) can be deserialized to
{{pandas.DataFrame}} objects with comparatively small overhead compared with
memcpy / system memory bandwidth -- Arrow's bitmasks must be examined the the
null values in the data have to be replace with pandas's sentinel values (None
or NaN as appropriate).
I will be contributing patches to Arrow in the coming weeks for converting
between Arrow and pandas in the general case, so if Spark can send Arrow memory
to PySpark, we will hopefully be able to increase the Python data access
throughput by an order of magnitude or more. I propose to add an new serializer
for Spark DataFrame and an method that can be invoked from PySpark to request a
Arrow memory-layout byte stream, prefixed by a data header indicating array
buffer offsets and sizes.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]