[
https://issues.apache.org/jira/browse/ARROW-14479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated ARROW-14479:
-----------------------------------
Labels: pull-request-available (was: )
> [C++][Compute] Hash Join microbenchmarks
> ----------------------------------------
>
> Key: ARROW-14479
> URL: https://issues.apache.org/jira/browse/ARROW-14479
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Affects Versions: 7.0.0
> Reporter: Michal Nowakiewicz
> Assignee: Sasha Krassovsky
> Priority: Major
> Labels: pull-request-available
> Fix For: 7.0.0
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> Implement a series of microbenchmarks giving a good picture of the
> performance of hash join implemented in Arrow across different set of
> dimensions.
> Compare the performance against some other product(s).
> Add scripts for generating useful visual reports giving a good picture of the
> costs of hash join.
> Examples of dimensions to explore in microbenchmarks:
> * number of duplicate keys on build side
> * relative size of build side to probe side
> * selectivity of the join
> * number of key columns
> * number of payload columns
> * filtering performance for semi- and anti- joins
> * dense integer key vs sparse integer key vs string key
> * build size
> * scaling of build, filtering, probe
> * inner vs left outer, inner vs right outer
> * left semi vs right semi, left anti vs right anti, left outer vs right outer
> * non-uniform key distribution
> * monotonic key values in input, partitioned key values in input (with and
> without per batch min-max metadata)
> * chain of multiple hash joins
> * overhead of Bloom filter for non-selective Bloom filter
--
This message was sent by Atlassian Jira
(v8.20.1#820001)