Github user BryanCutler commented on the issue:
https://github.com/apache/spark/pull/21546
### Performance Tests - toPandas
Tests run on a 4 node standalone cluster with 32 cores total,
14.04.1-Ubuntu and OpenJDK 8
measured wall clock time to execute `toPandas()` and took the average best
time of 5 runs/5 loops each.
Test code
```python
df = spark.range(1 << 25, numPartitions=32).toDF("id").withColumn("x1",
rand()).withColumn("x2", rand()).withColumn("x3", rand()).withColumn("x4",
rand())
for i in range(5):
start = time.time()
_ = df.toPandas()
elapsed = time.time() - start
```
| Current Master | This PR |
|---------------------|------------|
5.803557 | 4.342533
5.409119 | 4.399408
5.493509 | 4.468471
5.433107 | 4.36524
5.488757 | 4.373791
| Avg Master | Avg This PR
-----------|--------------
5.5256098 | 4.3898886
#### Speedup of 1.258712989
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]