Github user BryanCutler commented on the issue:
https://github.com/apache/spark/pull/21546
_UPDATE_
I changed `toPandas` to write out of order partitions to python as they
come in, followed by a list of indices to represent the correct batch order.
In python, the batches are then put in the correct order to make the Arrow
Table / Pandas DataFrame. This is slightly more complicated than before
because we are sending extra info to Python, but it significantly reduces the
upper-bound space complexity in the driver JVM from all data to the size of the
largest partition. It also seems to be a little faster, so I re-ran the
performance tests, which I'll post now.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]