[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

BryanCutler Tue, 24 Jan 2017 16:23:35 -0800

Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/15821
  
    Here are some rough benchmarks done locally on machine with 16GB mem and 8 
cores, using Spark config defaults and taken from 50 trials of calling 
`toPandas()` with and without Arrow enabled:
    
    ## 1mm Longs
     _ | With Arrow | Without Arrow
    --|------------|-------------------
    count  |   50.000000 |    50.000000
    mean   |   0.190573 |      2.576587
    std   |    0.078450 |       0.114455
    min   |    0.139911 |       2.259916
    25%   |    0.148212 |        2.516289
    50%   |    0.163769 |        2.555433
    75%   |    0.184402 |        2.631316
    max    |   0.518090 |       2.946415
    
    **13.52x speedup** on average
    
    ## 1mm Doubles
     _ | With Arrow | Without Arrow
    --|------------|-------------------
    count  |  50.000000 | 50.000000
    mean   |   0.259145 | 2.090295
    std  |     0.069620 | 0.123091
    min  |     0.196666 | 1.998588
    25%  |     0.209051 | 2.015083
    50%   |    0.230751 | 2.032701
    75%    |   0.270519 | 2.122219
    max    |   0.439556 | 2.485232
    
    **8.07x speedup** on average
    
    Script to generate these can be found 
[here](https://issues.apache.org/jira/secure/attachment/12849193/benchmark.py)




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

Reply via email to