Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/21546#discussion_r204484728 --- Diff: python/pyspark/serializers.py --- @@ -184,27 +184,67 @@ def loads(self, obj): raise NotImplementedError -class ArrowSerializer(FramedSerializer): +class BatchOrderSerializer(Serializer): --- End diff -- Yeah, the performance gain by sending out of order batches was small, but the reason this was done was to improve memory usage in the driver JVM. Before this it still had a worst case of buffering the entire dataset in the JVM, but now nothing is buffered and partitions are immediately sent to Python. I think that's a huge improvement that is worth the additional complexity. This method might even be applicable to a `collect()` in Python also.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org