[GitHub] [arrow-datafusion] Dandandan commented on pull request #68: Experimenting with arrow2

GitBox Sat, 18 Sep 2021 01:13:02 -0700


Dandandan commented on pull request #68:
URL: https://github.com/apache/arrow-datafusion/pull/68#issuecomment-922238494



   Here the full table, thanks to @jorgecarleitao for implementing the parquet 
change so quickly. 
   
   (output in ms, average of 3 runs ) 
   | Query   |      Arrow      |  Arrow2 |
   |----------|:-------------:|------:|
   | 1 | 2071.80  | 2267.46 |
   | 3 | 1615.33 | 1817.35 |
   | 5 | 2990.24 | 3222.48 |
   | 6 |  883.02 | 1061.64 |
   | 8 | OOM | OOM |
   | 9 |  OOM | OOM |
   | 10 | 2784.24 | 3096.37 |
   | 12 | 1473.96 |  2039.88 |
   | 13 | 6207.32 | 5965.40 |
   
   We can see arrow is almost always faster than the arrow2 branch.
   Now, loaded into memory, it looks much more equal/mixed:
   
   | Query   |      Arrow      |  Arrow2 |
   |----------|:-------------:|------:|
   | 1 | 1775.07  | 1640.60 |
   | 3 | 1128.12 |  1109.23 |
   | 5 | 2528.62 | 2413.35 |
   | 6 | 215.58 | 211.19 |
   | 10 | 1886.62 | 2259.54 |
   | 12 | 739.42 | 820.00 |
   | 13 | 6397.59 | 6326.59 |
   
   It seems some portion of the change could be that loading parquet is slower 
in arrow2, somehow.
   E.g. loading the full lineitem table (16 files, 2.4GB total) takes ~8380 ms 
in arrow2 vs ~6484 ms in the master branch using the `-m` option in the 
benchmarking tool.
   @jorgecarleitao any idea of why this might happen? I could do some profiling 
if that's useful.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] Dandandan commented on pull request #68: Experimenting with arrow2

Reply via email to