[ 
https://issues.apache.org/jira/browse/ARROW-8990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17448298#comment-17448298
 ] 

Michal Nowakiewicz commented on ARROW-8990:
-------------------------------------------

One of the points of the hash table implementation used inside group by (and in 
the future potentially also hash join) exec node is to "push down 
vectorization" inside hash table code. What I mean by that is that the 
interface takes (small) vectors of inputs for lookups/inserts instead of single 
inputs. That allows for some specific optimizations, for instance in order to 
use memory prefetching the code needs to know future inputs in advance, which 
vector at a time interface provides. Third party hash tables are typically 
processing one input at a time, so they are not an exact replacement for this. 

That said, the interface of a third party hash table could be adapted to what 
we use currently in exec nodes and it is valid to compare the performance on 
some micro-benchmark. 

> [C++] Benchmark hash table against thirdparty options, possibly vendor a 
> thirdparty hash table library
> ------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-8990
>                 URL: https://issues.apache.org/jira/browse/ARROW-8990
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Wes McKinney
>            Priority: Major
>
> While we have our own hash table implementation, it would be worthwhile to 
> set up some benchmarks so that we can compare against std::unordered_map and 
> some other thirdparty libraries for hash tables to know whether we should 
> possibly use a thirdparty library. See e.g.
> https://tessil.github.io/2016/08/29/benchmark-hopscotch-map.html
> Libraries to consider: 
> * https://github.com/sparsehash/sparsehash



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to