[GitHub] spark issue #21546: [WIP][SPARK-23030][SQL][PYTHON] Use Arrow stream format ...

BryanCutler Wed, 27 Jun 2018 13:49:11 -0700

Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/21546
  
    _UPDATE_
    I changed `toPandas` to write out of order partitions to python as they 
come in, followed by a list of indices to represent the correct batch order.  
In python, the batches are then put in the correct order to make the Arrow 
Table / Pandas DataFrame.  This is slightly more complicated than before 
because we are sending extra info to Python, but it significantly reduces the 
upper-bound space complexity in the driver JVM from all data to the size of the 
largest partition.  It also seems to be a little faster, so I re-ran the 
performance tests, which I'll post now.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #21546: [WIP][SPARK-23030][SQL][PYTHON] Use Arrow stream format ...

Reply via email to