Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/21546 ### Performance Tests - toPandas Tests run on a 4 node standalone cluster with 32 cores total, 14.04.1-Ubuntu and OpenJDK 8 measured wall clock time to execute `toPandas()` and took the average best time of 5 runs/5 loops each. Test code ```python df = spark.range(1 << 25, numPartitions=32).toDF("id").withColumn("x1", rand()).withColumn("x2", rand()).withColumn("x3", rand()).withColumn("x4", rand()) for i in range(5): start = time.time() _ = df.toPandas() elapsed = time.time() - start ``` | Current Master | This PR | |---------------------|------------| 5.803557 | 4.342533 5.409119 | 4.399408 5.493509 | 4.468471 5.433107 | 4.36524 5.488757 | 4.373791 | Avg Master | Avg This PR -----------|-------------- 5.5256098 | 4.3898886 #### Speedup of 1.258712989
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org