Github user BryanCutler commented on the issue:
https://github.com/apache/spark/pull/21546
### Memory Improvements
**toPandas()**
The most significant improvement is reduction of the upper bound space
complexity in the JVM driver. Before, the entire dataset was collected in the
JVM first before sending it to Python. With this change, as soon as a
partition is collected, the result handler immediately sends it to Python, so
the upper bound is the size of the largest partition. Also, using the Arrow
stream format is more efficient because the schema is written once per stream,
followed by record batches. The schema is now only send from driver JVM to
Python. Before, multiple Arrow file formats were used that each contained the
schema. This duplicated schema was created in the executors, sent to the
driver JVM, and then Python where all but the first one received are discarded.
I verified the upper bound limit by running a test that would collect data
that would exceed the amount of driver JVM memory available. Using these
settings on a standalone cluster:
```
spark.driver.memory 1g
spark.executor.memory 5g
spark.sql.execution.arrow.enabled true
spark.sql.execution.arrow.fallback.enabled false
spark.sql.execution.arrow.maxRecordsPerBatch 0
spark.driver.maxResultSize 2g
```
Test code:
```python
from pyspark.sql.functions import rand
df = spark.range(1 << 25, numPartitions=32).toDF("id").withColumn("x1",
rand()).withColumn("x2", rand()).withColumn("x3", rand())
df.toPandas()
```
This makes total data size of 33554432Ã8Ã4 = 1073741824
With the current master, it fails with OOM but passes using this PR.
**createDataFrame()**
No significant change in memory except that using the stream format instead
of separate file formats avoids duplicated the schema, similar to toPandas
above. The process of reading the stream and parallelizing the batches does
cause the record batch message metadata to be copied, but it's size is
insignificant.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]