[ https://issues.apache.org/jira/browse/ARROW-8990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17448298#comment-17448298 ]
Michal Nowakiewicz commented on ARROW-8990: ------------------------------------------- One of the points of the hash table implementation used inside group by (and in the future potentially also hash join) exec node is to "push down vectorization" inside hash table code. What I mean by that is that the interface takes (small) vectors of inputs for lookups/inserts instead of single inputs. That allows for some specific optimizations, for instance in order to use memory prefetching the code needs to know future inputs in advance, which vector at a time interface provides. Third party hash tables are typically processing one input at a time, so they are not an exact replacement for this. That said, the interface of a third party hash table could be adapted to what we use currently in exec nodes and it is valid to compare the performance on some micro-benchmark. > [C++] Benchmark hash table against thirdparty options, possibly vendor a > thirdparty hash table library > ------------------------------------------------------------------------------------------------------ > > Key: ARROW-8990 > URL: https://issues.apache.org/jira/browse/ARROW-8990 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ > Reporter: Wes McKinney > Priority: Major > > While we have our own hash table implementation, it would be worthwhile to > set up some benchmarks so that we can compare against std::unordered_map and > some other thirdparty libraries for hash tables to know whether we should > possibly use a thirdparty library. See e.g. > https://tessil.github.io/2016/08/29/benchmark-hopscotch-map.html > Libraries to consider: > * https://github.com/sparsehash/sparsehash -- This message was sent by Atlassian Jira (v8.20.1#820001)