[GitHub] spark issue #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark DataFra...

BryanCutler Mon, 09 Oct 2017 14:06:08 -0700

Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/19459
  
    Benchmarks for running in local mode 16 GB memory, i7-4800MQ CPU @ 2.70GHz 
Ã 8 cores
    using default Spark configuration
    data is 10 columns of doubles with 100,000 rows
    
    Code:
    ```python
    import pandas as pd
    import numpy as np
    spark.conf.set("spark.sql.execution.arrow.enable", "false")
    pdf = pd.DataFrame(np.random.rand(100000, 10), columns=list("abcdefghij"))
    %timeit spark.createDataFrame(pdf)
    spark.conf.set("spark.sql.execution.arrow.enable", "true")
    %timeit spark.createDataFrame(pdf)
    ```
    
    Without Arrow: 
    1 loop, best of 3: 7.21 s per loop
    
    With Arrow:
    10 loops, best of 3: 30.6 ms per loop
    
    **Speedup of ~ 235x**
    
    Also, tested creating up to 2 million rows with Arrow and results scale 
linearly



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark DataFra...

Reply via email to