[GitHub] spark issue #21546: [WIP][SPARK-23030][SQL][PYTHON] Use Arrow stream format ...

BryanCutler Wed, 27 Jun 2018 14:23:35 -0700

Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/21546
  
    ### Performance Tests - toPandas
    
    Tests run on a 4 node standalone cluster with 32 cores total, 
14.04.1-Ubuntu and OpenJDK 8
    measured wall clock time to execute `toPandas()` and took the average best 
time of 5 runs/5 loops each.
    
    Test code
    ```python
    df = spark.range(1 << 25, numPartitions=32).toDF("id").withColumn("x1", 
rand()).withColumn("x2", rand()).withColumn("x3", rand()).withColumn("x4", 
rand())
    for i in range(5):
        start = time.time()
        _ = df.toPandas()
        elapsed = time.time() - start
    ```
    
    | Current Master | This PR |
    |---------------------|------------|
    5.803557 | 4.342533
    5.409119 | 4.399408
    5.493509 | 4.468471
    5.433107 | 4.36524
    5.488757 | 4.373791
    
    | Avg Master | Avg This PR
    -----------|--------------
    5.5256098 | 4.3898886 
    
    #### Speedup of 1.258712989




---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21546: [WIP][SPARK-23030][SQL][PYTHON] Use Arrow stream format ...

Reply via email to