[GitHub] spark pull request #19349: [SPARK-22125][PYSPARK][SQL] Enable Arrow Stream f...

viirya Tue, 26 Sep 2017 21:28:52 -0700

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19349#discussion_r141242240
  
    --- Diff: python/pyspark/serializers.py ---
    @@ -251,6 +256,36 @@ def __repr__(self):
             return "ArrowPandasSerializer"
     
     
    +class ArrowStreamPandasSerializer(Serializer):
    +    """
    +    Serializes Pandas.Series as Arrow data with Arrow streaming format.
    +    """
    +
    +    def load_stream(self, stream):
    +        import pyarrow as pa
    +        reader = pa.open_stream(stream)
    +        for batch in reader:
    +            table = pa.Table.from_batches([batch])
    +            yield [c.to_pandas() for c in table.itercolumns()]
    +
    +    def dump_stream(self, iterator, stream):
    --- End diff --
    
    Maybe add few comments for `dump_stream` and `load_stream` like 
`ArrowPandasSerializer`.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #19349: [SPARK-22125][PYSPARK][SQL] Enable Arrow Stream f...

Reply via email to