[GitHub] spark pull request #21546: [SPARK-23030][SQL][PYTHON] Use Arrow stream forma...

BryanCutler Mon, 23 Jul 2018 10:12:02 -0700

Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21546#discussion_r204484728
  
    --- Diff: python/pyspark/serializers.py ---
    @@ -184,27 +184,67 @@ def loads(self, obj):
             raise NotImplementedError
     
     
    -class ArrowSerializer(FramedSerializer):
    +class BatchOrderSerializer(Serializer):
    --- End diff --
    
    Yeah, the performance gain by sending out of order batches was small, but 
the reason this was done was to improve memory usage in the driver JVM.  Before 
this it still had a worst case of buffering the entire dataset in the JVM, but 
now nothing is buffered and partitions are immediately sent to Python.  I think 
that's a huge improvement that is worth the additional complexity. This method 
might even be applicable to a `collect()` in Python also.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21546: [SPARK-23030][SQL][PYTHON] Use Arrow stream forma...

Reply via email to