Dandandan commented on pull request #68: URL: https://github.com/apache/arrow-datafusion/pull/68#issuecomment-922238494
Here the full table, thanks to @jorgecarleitao for implementing the parquet change so quickly. (output in ms, average of 3 runs ) | Query | Arrow | Arrow2 | |----------|:-------------:|------:| | 1 | 2071.80 | 2267.46 | | 3 | 1615.33 | 1817.35 | | 5 | 2990.24 | 3222.48 | | 6 | 883.02 | 1061.64 | | 8 | OOM | OOM | | 9 | OOM | OOM | | 10 | 2784.24 | 3096.37 | | 12 | 1473.96 | 2039.88 | | 13 | 6207.32 | 5965.40 | We can see arrow is almost always faster than the arrow2 branch. Now, loaded into memory, it looks much more equal/mixed: | Query | Arrow | Arrow2 | |----------|:-------------:|------:| | 1 | 1775.07 | 1640.60 | | 3 | 1128.12 | 1109.23 | | 5 | 2528.62 | 2413.35 | | 6 | 215.58 | 211.19 | | 10 | 1886.62 | 2259.54 | | 12 | 739.42 | 820.00 | | 13 | 6397.59 | 6326.59 | It seems some portion of the change could be that loading parquet is slower in arrow2, somehow. E.g. loading the full lineitem table (16 files, 2.4GB total) takes ~8380 ms in arrow2 vs ~6484 ms in the master branch using the `-m` option in the benchmarking tool. @jorgecarleitao any idea of why this might happen? I could do some profiling if that's useful. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org