[
https://issues.apache.org/jira/browse/ARROW-14479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17451018#comment-17451018
]
Michal Nowakiewicz commented on ARROW-14479:
--------------------------------------------
h1. Dimensions
Below is the list of useful dimensions for hash join performance testing.
We would like to have one test specifically targeting each of these dimensions.
At the end of the description of each of the dimensions below there may be a
note suggesting how to combine this dimension with other related dimensions for
multi-dimensional performance analysis. We can roughly assume that exploration
of multi-dimensional space with up to 3 dimensions would mean producing one
graph for each value of the dimension with the smallest cardinality, with one
curve for each value of the second smallest cardinality dimension, using the
highest cardinality dimension as x axis.
h2. 1. Size of the hash table
Size of the hash table can be measured in bytes or in the number of inserted
elements (either rows or unique keys on build side). Different implementations
can use different organization of data in hash tables and related data
structures and because of that produce different sizes of these data structures
for the same set of input rows. It seems reasonable to use the number of
inserted rows as the input parameter for benchmarks and size of memory used as
the output of measurement.
Measuring performance for varying size of the hash table is the fundamental
test of the core hash table implementation. Going from tens to millions of
inserted items in the hash table, we can observe the impact of various types of
CPU cache misses on the performance. The graph should show performance dropping
at the points where each type of CPU cache becomes smaller than what is
required for hash table data. The exact input value point where it happens
depends on the organization of data in the hash table and may vary from one
implementation to another (some implementations may for instance use
light-weight compression of data in the hash table to move these points further
to the right on the graph).
The extremes on both sides of this graph are of particular interest. Testing
with hash tables fitting into CPU L1 cache takes away the cost of memory access
and verifies the implementation of things like: hashing, traversing hash table,
fetching, transforming and moving data from hash table. Testing with a hash
table filling a significant chunk of all available memory verifies the scenario
when the main bottleneck by far is loading a cache line from main memory. In
this case, techniques like using large/huge pages, memory prefetching,
partitioning before accessing hash tables, play a significant role in achieving
good performance despite the fact that the same techniques would hurt the
performance on small hash tables. Another interesting data point is the
throughput in terms of bytes per second output by hash join in that scenario.
What makes it interesting is that this number can often be similar or even
worse (at least for a single-threaded hash join) than the SSD storage
read/write bandwidth. What it means is that for the largest hash table, the
implementation that uses spill-over to disk storage may not be much worse than
in-memory processing and there may be a point where the two performance lines
meet (a point defined by the hash table size combined with the number of
processing threads, at which there is no benefit to fully in-memory processing
in terms of execution time).
h2. 2. Data types
Another interesting dimension is data types and number of key columns. Hashing,
storing of keys in a hash table, looking up keys in the hash table is typically
much different between fixed-length data types (e.g. integer keys) and
varying-length data types (e.g. string keys). There are optimizations available
uniquely for integer keys from a dense domain (range of key values comparable
to number of distinct keys).
Typically the number of key columns is small (single digit). There are some
interesting performance questions related to the number of keys. An example
would be comparing performance on 4x int64_t columns to performance on a 1x 32B
binary column. Since there is an easy way to transform data from the first case
to the second case, a large difference in performance could be a sign of a
problem with implementation for one of these cases. Similar scenario for
strings could involve comparing 4x strings having from 0 to 16 characters each
to a 1x string having from 0 to 64 characters.
Combine with 1.
h2. 3. Selectivity
One of the functions of the hash join (for semi or inner join) is to remove
rows on the probe side with no matches on the build side. Hash join may choose
for instance to build a Bloom filter for keys on the build side to quickly
eliminate such rows on the probe side, providing a fast membership test but
with a small probability of false positives. It may also be interesting to
compare the filtering performance of semi hash join to that of in-list filter,
since both implement roughly the same functionality.
Filtering performance of hash join mostly depends on two factors:
a) number of distinct keys on build side (or number of inserted rows on build
side for some implementations, which may be different in case of duplicate
keys);
b) probability of accepting a key on probe side (selectivity).
The first factor affects the cost of a single filter query, while the second
affects the relative portion of processing time spent doing post-filter tasks,
like hash table lookups and materialization of matching rows from build side.
Included in the join filtering performance test should be a comparison of
semi-join to anti-semi-join. Side by side view of the results obtained for both
these join types when the selectivities are the same is interesting. On one
hand, it seems that anti-join could be implemented just the same way as
semi-join just with filter results negated. On the other hand, there are some
important differences between these two cases, e.g. semi-join can use Bloom
filters (with false positives) for fast elimination of rows, while anti-join
cannot (Bloom filter would provide false negatives in such a case).
Combine with 1. and 2.
h2. 4. Join types
There are 8 basic types of join: left/right semi, left/right anti, inner,
left/right outer and full outer. The processing pipeline for each of them is
slightly different with a different subset of required steps. Join filtering
tests will cover evaluation of semi and anti variants, basic tests will be
based on inner join, which leaves outer joins as missing area for performance
evaluation. Full outer join combines the unique steps of left and right outer
joins, so it should be enough to test left and right outer joins separately to
get a good picture of performance of hash join steps unique for outer joins.
Combine with 1. and 3.
h2. 5. Number of matches
The work required to process many-to-many joins may present unique challenges
that are not present in case of many-to-one joins. For instance, the
possibility of having multiple matches for an input row on the probe side means
that we may need to produce multiple copies of that row - one for each match in
the hash table. We may also need to traverse chains of rows stored in the hash
table for a specific single key in order to retrieve payloads from the build
side of the join for all matching rows.
Of particular interest is looking at the average number of matches in the
context of inequality joins. Hash join can perform a join operation with a
predicate that mixes key equalities with key inequalities. The simplest
implementation uses a hash table for equalities and then traverses potentially
many matches coming from it, evaluating inequalities for each of them. The
usefulness of such an implementation depends on the cost of traversal of chains
of matches for a given key and materialization of their data for the purpose of
residual predicate evaluation. It may be acceptable for high selectivity
residual predicates (high percentage of candidate matching rows pass the
predicate) but not for low selectivity ones (most of the matching rows are
false positives rejected by subsequent inequality tests).
Combine with 1.
h2. 6. Payload size
Size of payload columns affects the cost of data movement when partitioning,
copying or fetching rows from a hash table. It seems reasonable to look at
varying payload sizes in the range of perhaps 0 to 100 bytes.
Combine with 1. and 5.
h2. 7. Degree of parallelism
Probe side should scale very well with increasing number of executing threads,
since (almost) all data structures, including the hash table, are read-only
during probe side processing and do not require any synchronization between the
threads. Also, all threads share the same read-only copy of data, which means
that they use CPU cache during probe side processing in an efficient way
(without competing needs across threads executing hash join together).
Build side is of particular interest with respect to degree of parallelism,
since the hash table build process is not trivially parallelizable.
Combine with: 1. and 2.
h2. 8. Dictionaries
Arrow data format supports dictionary encoded values (there is an array of
values called a dictionary and another array storing only integer ids pointing
to values in the dictionary). There is an opportunity to process data faster in
hash join in the presence of dictionary encoded key columns, since a dictionary
already accomplishes some of the work that hash join needs to do when mapping
duplicate keys to the same hash table entry. The most meaningful scenario for
hash join involves normalized dictionaries - dictionaries with only unique
values. It is interesting to compare hash join performance with the same key
column when using a normalized dictionary and when not using it. There are 4
interesting cases: no dictionaries, dictionary on build side but not on probe
side, dictionary on probe side but not on build side, (different) dictionaries
on both sides.
Combine with 1. and 2.
h2. 9. [Optional] Distribution of key frequencies
Tests involving dimensions above should be performed on key columns with
uniform distribution for most meaningful results. Despite that, the real-world
data often does not follow a uniform distribution and is closer to some form of
exponential distribution function. Although it is not a common thing to be
implemented, it is possible to add to hash join performance optimizations
specifically related to different keys having much different frequencies on the
probe side. Keeping most frequent keys and their payloads together in memory
should change the cache hit ratios in case of hash tables larger than the CPU
cache.
Combine with 1.
h1. Separating build side and probe side costs
In the course of hash join processing, typically processing of build side and
probe side of the join happens in separate phases, e.g. join starts with hash
table build operation that involves only the build side, then switches to hash
table lookups for all rows on the probe side, and then finally optionally scans
the hash table (outputting either hash table rows with any matches or with no
matches). Most dimensions of hash join performance testing mentioned in this
document affect both sides of the join, but the challenges for build side and
probe side can be much different. When testing, it is easy to vary the relative
size (in number of rows) of build side vs probe side. Generating 10x more rows
on the probe side will make almost all of build side processing overshadowed by
the probe side processing cost. Symmetrically, testing with 10x less rows on
the probe side will mostly show only the build side processing costs. There is
no need for varying relative sizes of both sides of the hash join between two
extremes, since the costs in between should with high probability just follow a
linear combination of the costs for the extremes.
> [C++][Compute] Hash Join microbenchmarks
> ----------------------------------------
>
> Key: ARROW-14479
> URL: https://issues.apache.org/jira/browse/ARROW-14479
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Affects Versions: 7.0.0
> Reporter: Michal Nowakiewicz
> Assignee: Sasha Krassovsky
> Priority: Major
> Fix For: 7.0.0
>
>
> Implement a series of microbenchmarks giving a good picture of the
> performance of hash join implemented in Arrow across different set of
> dimensions.
> Compare the performance against some other product(s).
> Add scripts for generating useful visual reports giving a good picture of the
> costs of hash join.
> Examples of dimensions to explore in microbenchmarks:
> * number of duplicate keys on build side
> * relative size of build side to probe side
> * selectivity of the join
> * number of key columns
> * number of payload columns
> * filtering performance for semi- and anti- joins
> * dense integer key vs sparse integer key vs string key
> * build size
> * scaling of build, filtering, probe
> * inner vs left outer, inner vs right outer
> * left semi vs right semi, left anti vs right anti, left outer vs right outer
> * non-uniform key distribution
> * monotonic key values in input, partitioned key values in input (with and
> without per batch min-max metadata)
> * chain of multiple hash joins
> * overhead of Bloom filter for non-selective Bloom filter
--
This message was sent by Atlassian Jira
(v8.20.1#820001)