[GitHub] spark issue #21546: [WIP][SPARK-23030][SQL][PYTHON] Use Arrow stream format ...

BryanCutler Wed, 27 Jun 2018 14:27:14 -0700

Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/21546
  
    ### Performance Tests - createDataFrame
    
    Tests run on a 4 node standalone cluster with 32 cores total, 
14.04.1-Ubuntu and OpenJDK 8
    measured wall clock time to execute `createDataFrame()` and get the first 
record. Took the average best time of 5 runs/5 loops each.
    
    Test code
    ```python
    def run():
        pdf = pd.DataFrame(np.random.rand(10000000, 10))
        spark.createDataFrame(pdf).first()
    
    for i in range(6):
        start = time.time()
        run()
        elapsed = time.time() - start
        gc.collect()
        print("Run %d: %f" % (i, elapsed))
    ```
    
    | Current Master | This PR |
    |---------------------|------------|
    6.234608 | 5.665641
    6.32144 | 5.3475
    6.527859 | 5.370803
    6.95089 | 5.479151
    6.235046 | 5.529167
    
    
    
    | Avg Master | Avg This PR
    -----------|--------------
    6.4539686 | 5.4784524
    
    #### Speedup of 1.178064192




---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #21546: [WIP][SPARK-23030][SQL][PYTHON] Use Arrow stream format ...

Reply via email to