Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/21546
  
    ### Performance Tests - toPandas
    
    Tests run on a 4 node standalone cluster with 32 cores total, 
14.04.1-Ubuntu and OpenJDK 8
    measured wall clock time to execute `toPandas()` and took the average best 
time of 5 runs/5 loops each.
    
    Test code
    ```python
    df = spark.range(1 << 25, numPartitions=32).toDF("id").withColumn("x1", 
rand()).withColumn("x2", rand()).withColumn("x3", rand()).withColumn("x4", 
rand())
    for i in range(5):
        start = time.time()
        _ = df.toPandas()
        elapsed = time.time() - start
    ```
    
    | Current Master | This PR |
    |---------------------|------------|
    5.803557 | 4.342533
    5.409119 | 4.399408
    5.493509 | 4.468471
    5.433107 | 4.36524
    5.488757 | 4.373791
    
    | Avg Master | Avg This PR
    -----------|--------------
    5.5256098 | 4.3898886 
    
    #### Speedup of 1.258712989
    
    
    
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to