[ https://issues.apache.org/jira/browse/ARROW-6417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16923566#comment-16923566 ]
Wes McKinney commented on ARROW-6417: ------------------------------------- OK, it appears that the jemalloc version is causing the perf difference current master branch with vendored jemalloc version (4.something with patches) {code} $ python 20190903_parquet_benchmark.py dense-random 100000 ({'case': 'read-dense-random-single-thread'}, 0.6065331888198853) {code} master with jemalloc 5.2.0 {code} $ python 20190903_parquet_benchmark.py dense-random 100000 ({'case': 'read-dense-random-single-thread'}, 1.2143790817260742) {code} To reproduce these results yourself * Get the old jemalloc tarball from here https://github.com/apache/arrow/tree/maint-0.12.x/cpp/thirdparty/jemalloc * Set {{$ARROW_JEMALLOC_URL}} to the path of that before building * Use this branch which has the old EP configuration https://github.com/wesm/arrow/tree/use-old-jemalloc Here's the benchmark script that I'm running above https://gist.github.com/wesm/7e5ae1d41981cfdd20415faf71e5f57e I'm interested if other benchmarks are affected or if this is a peculiarity of this particular benchmark > [C++][Parquet] Non-dictionary BinaryArray reads from Parquet format have > slowed down since 0.11.x > ------------------------------------------------------------------------------------------------- > > Key: ARROW-6417 > URL: https://issues.apache.org/jira/browse/ARROW-6417 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python > Reporter: Wes McKinney > Priority: Major > Labels: pull-request-available > Attachments: 20190903_parquet_benchmark.py, > 20190903_parquet_read_perf.png > > Time Spent: 1h > Remaining Estimate: 0h > > In doing some benchmarking, I have found that binary reads seem to be slower > from Arrow 0.11.1 to master branch. It would be a good idea to do some basic > profiling to see where we might improve our memory allocation strategy (or > whatever the bottleneck turns out to be) -- This message was sent by Atlassian Jira (v8.3.2#803003)