[ 
https://issues.apache.org/jira/browse/ARROW-14479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17451018#comment-17451018
 ] 

Michal Nowakiewicz commented on ARROW-14479:
--------------------------------------------

h1. Dimensions

Below is the list of useful dimensions for hash join performance testing.

We would like to have one test specifically targeting each of these dimensions. 
At the end of the description of each of the dimensions below there may be a 
note suggesting how to combine this dimension with other related dimensions for 
multi-dimensional performance analysis. We can roughly assume that exploration 
of multi-dimensional space with up to 3 dimensions would mean producing one 
graph for each value of the dimension with the smallest cardinality, with one 
curve for each value of the second smallest cardinality dimension, using the 
highest cardinality dimension as x axis.
h2. 1. Size of the hash table

Size of the hash table can be measured in bytes or in the number of inserted 
elements (either rows or unique keys on build side). Different implementations 
can use different organization of data in hash tables and related data 
structures and because of that produce different sizes of these data structures 
for the same set of input rows. It seems reasonable to use the number of 
inserted rows as the input parameter for benchmarks and size of memory used as 
the output of measurement.

Measuring performance for varying size of the hash table is the fundamental 
test of the core hash table implementation. Going from tens to millions of 
inserted items in the hash table, we can observe the impact of various types of 
CPU cache misses on the performance. The graph should show performance dropping 
at the points where each type of CPU cache becomes smaller than what is 
required for hash table data. The exact input value point where it happens 
depends on the organization of data in the hash table and may vary from one 
implementation to another (some implementations may for instance use 
light-weight compression of data in the hash table to move these points further 
to the right on the graph). 

The extremes on both sides of this graph are of particular interest. Testing 
with hash tables fitting into CPU L1 cache takes away the cost of memory access 
and verifies the implementation of things like: hashing, traversing hash table, 
fetching, transforming and moving data from hash table. Testing with a hash 
table filling a significant chunk of all available memory verifies the scenario 
when the main bottleneck by far is loading a cache line from main memory. In 
this case, techniques like using large/huge pages, memory prefetching, 
partitioning before accessing hash tables, play a significant role in achieving 
good performance despite the fact that the same techniques would hurt the 
performance on small hash tables. Another interesting data point is the 
throughput in terms of bytes per second output by hash join in that scenario. 
What makes it interesting is that this number can often be similar or even 
worse (at least for a single-threaded hash join) than the SSD storage 
read/write bandwidth. What it means is that for the largest hash table, the 
implementation that uses spill-over to disk storage may not be much worse than 
in-memory processing and there may be a point where the two performance lines 
meet (a point defined by the hash table size combined with the number of 
processing threads, at which there is no benefit to fully in-memory processing 
in terms of execution time).
h2. 2. Data types

Another interesting dimension is data types and number of key columns. Hashing, 
storing of keys in a hash table, looking up keys in the hash table is typically 
much different between fixed-length data types (e.g. integer keys) and 
varying-length data types (e.g. string keys). There are optimizations available 
uniquely for integer keys from a dense domain (range of key values comparable 
to number of distinct keys). 

Typically the number of key columns is small (single digit). There are some 
interesting performance questions related to the number of keys. An example 
would be comparing performance on 4x int64_t columns to performance on a 1x 32B 
binary column. Since there is an easy way to transform data from the first case 
to the second case, a large difference in performance could be a sign of a 
problem with implementation for one of these cases. Similar scenario for 
strings could involve comparing 4x strings having from 0 to 16 characters each 
to a 1x string having from 0 to 64 characters.

Combine with 1.
h2. 3. Selectivity

One of the functions of the hash join (for semi or inner join) is to remove 
rows on the probe side with no matches on the build side. Hash join may choose 
for instance to build a Bloom filter for keys on the build side to quickly 
eliminate such rows on the probe side, providing a fast membership test but 
with a small probability of false positives. It may also be interesting to 
compare the filtering performance of semi hash join to that of in-list filter, 
since both implement roughly the same functionality.

Filtering performance of hash join mostly depends on two factors:

a) number of distinct keys on build side (or number of inserted rows on build 
side for some implementations, which may be different in case of duplicate 
keys);

b) probability of accepting a key on probe side (selectivity).

The first factor affects the cost of a single filter query, while the second 
affects the relative portion of processing time spent doing post-filter tasks, 
like hash table lookups and materialization of matching rows from build side.

Included in the join filtering performance test should be a comparison of 
semi-join to anti-semi-join. Side by side view of the results obtained for both 
these join types when the selectivities are the same is interesting. On one 
hand, it seems that anti-join could be implemented just the same way as 
semi-join just with filter results negated. On the other hand, there are some 
important differences between these two cases, e.g. semi-join can use Bloom 
filters (with false positives) for fast elimination of rows, while anti-join 
cannot (Bloom filter would provide false negatives in such a case). 

Combine with 1. and 2.
h2. 4. Join types

There are 8 basic types of join: left/right semi, left/right anti, inner, 
left/right outer and full outer. The processing pipeline for each of them is 
slightly different with a different subset of required steps. Join filtering 
tests will cover evaluation of semi and anti variants, basic tests will be 
based on inner join, which leaves outer joins as missing area for performance 
evaluation. Full outer join combines the unique steps of left and right outer 
joins, so it should be enough to test left and right outer joins separately to 
get a good picture of performance of hash join steps unique for outer joins.

 

Combine with 1. and 3.
h2. 5. Number of matches

The work required to process many-to-many joins may present unique challenges 
that are not present in case of many-to-one joins. For instance, the 
possibility of having multiple matches for an input row on the probe side means 
that we may need to produce multiple copies of that row - one for each match in 
the hash table. We may also need to traverse chains of rows stored in the hash 
table for a specific single key in order to retrieve payloads from the build 
side of the join for all matching rows. 

 

Of particular interest is looking at the average number of matches in the 
context of inequality joins. Hash join can perform a join operation with a 
predicate that mixes key equalities with key inequalities. The simplest 
implementation uses a hash table for equalities and then traverses potentially 
many matches coming from it, evaluating inequalities for each of them. The 
usefulness of such an implementation depends on the cost of traversal of chains 
of matches for a given key and materialization of their data for the purpose of 
residual predicate evaluation. It may be acceptable for high selectivity 
residual predicates (high percentage of candidate matching rows pass the 
predicate) but not for low selectivity ones (most of the matching rows are 
false positives rejected by subsequent inequality tests).

 

Combine with 1.
h2. 6. Payload size

Size of payload columns affects the cost of data movement when partitioning, 
copying or fetching rows from a hash table. It seems reasonable to look at 
varying payload sizes in the range of perhaps 0 to 100 bytes. 

 

Combine with 1. and 5.
h2. 7. Degree of parallelism

Probe side should scale very well with increasing number of executing threads, 
since (almost) all data structures, including the hash table, are read-only 
during probe side processing and do not require any synchronization between the 
threads. Also, all threads share the same read-only copy of data, which means 
that they use CPU cache during probe side processing in an efficient way 
(without competing needs across threads executing hash join together). 

 

Build side is of particular interest with respect to degree of parallelism, 
since the hash table build process is not trivially parallelizable.

 

Combine with: 1. and 2.
h2. 8. Dictionaries

Arrow data format supports dictionary encoded values (there is an array of 
values called a dictionary and another array storing only integer ids pointing 
to values in the dictionary). There is an opportunity to process data faster in 
hash join in the presence of dictionary encoded key columns, since a dictionary 
already accomplishes some of the work that hash join needs to do when mapping 
duplicate keys to the same hash table entry. The most meaningful scenario for 
hash join involves normalized dictionaries - dictionaries with only unique 
values. It is interesting to compare hash join performance with the same key 
column when using a normalized dictionary and when not using it. There are 4 
interesting cases: no dictionaries, dictionary on build side but not on probe 
side, dictionary on probe side but not on build side, (different) dictionaries 
on both sides.

 

Combine with 1. and 2.
h2. 9. [Optional] Distribution of key frequencies

Tests involving dimensions above should be performed on key columns with 
uniform distribution for most meaningful results. Despite that, the real-world 
data often does not follow a uniform distribution and is closer to some form of 
exponential distribution function. Although it is not a common thing to be 
implemented, it is possible to add to hash join performance optimizations 
specifically related to different keys having much different frequencies on the 
probe side. Keeping most frequent keys and their payloads together in memory 
should change the cache hit ratios in case of hash tables larger than the CPU 
cache.

 

Combine with 1.
h1. Separating build side and probe side costs

In the course of hash join processing, typically processing of build side and 
probe side of the join happens in separate phases, e.g. join starts with hash 
table build operation that involves only the build side, then switches to hash 
table lookups for all rows on the probe side, and then finally optionally scans 
the hash table (outputting either hash table rows with any matches or with no 
matches). Most dimensions of hash join performance testing mentioned in this 
document affect both sides of the join, but the challenges for build side and 
probe side can be much different. When testing, it is easy to vary the relative 
size (in number of rows) of build side vs probe side. Generating 10x more rows 
on the probe side will make almost all of build side processing overshadowed by 
the probe side processing cost. Symmetrically, testing with 10x less rows on 
the probe side will mostly show only the build side processing costs. There is 
no need for varying relative sizes of both sides of the hash join between two 
extremes, since the costs in between should with high probability just follow a 
linear combination of the costs for the extremes. 

> [C++][Compute] Hash Join microbenchmarks
> ----------------------------------------
>
>                 Key: ARROW-14479
>                 URL: https://issues.apache.org/jira/browse/ARROW-14479
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>    Affects Versions: 7.0.0
>            Reporter: Michal Nowakiewicz
>            Assignee: Sasha Krassovsky
>            Priority: Major
>             Fix For: 7.0.0
>
>
> Implement a series of microbenchmarks giving a good picture of the 
> performance of hash join implemented in Arrow across different set of 
> dimensions.
> Compare the performance against some other product(s).
> Add scripts for generating useful visual reports giving a good picture of the 
> costs of hash join.
> Examples of dimensions to explore in microbenchmarks:
>  * number of duplicate keys on build side
>  * relative size of build side to probe side
>  * selectivity of the join
>  * number of key columns
>  * number of payload columns
>  * filtering performance for semi- and anti- joins
>  * dense integer key vs sparse integer key vs string key
>  * build size
>  * scaling of build, filtering, probe
>  * inner vs left outer, inner vs right outer
>  * left semi vs right semi, left anti vs right anti, left outer vs right outer
>  * non-uniform key distribution
>  * monotonic key values in input, partitioned key values in input (with and 
> without per batch min-max metadata)
>  * chain of multiple hash joins
>  * overhead of Bloom filter for non-selective Bloom filter



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to