Github user viirya commented on a diff in the pull request:
https://github.com/apache/spark/pull/19349#discussion_r141242240
--- Diff: python/pyspark/serializers.py ---
@@ -251,6 +256,36 @@ def __repr__(self):
return "ArrowPandasSerializer"
+class ArrowStreamPandasSerializer(Serializer):
+ """
+ Serializes Pandas.Series as Arrow data with Arrow streaming format.
+ """
+
+ def load_stream(self, stream):
+ import pyarrow as pa
+ reader = pa.open_stream(stream)
+ for batch in reader:
+ table = pa.Table.from_batches([batch])
+ yield [c.to_pandas() for c in table.itercolumns()]
+
+ def dump_stream(self, iterator, stream):
--- End diff --
Maybe add few comments for `dump_stream` and `load_stream` like
`ArrowPandasSerializer`.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]