(key concepts are triple-star bold-italic for your reading ease)
@inv2004 \- fnv1 does byte-at-a-time processing which is slower compared to
8-byte at a time processing on modern CPUs just because there are many fewer
loops. So, **_for example_** , here is a table I made { on just one CPU
(i7-6700k) } for **_string hashing with C implementations_** :
mem-fold : 0.03162 +- 0.00012 *$2 + 6.036 +- 0.012 *$3
mem-farm : 0.16570 +- 0.00014 *$2 + -1.8574 +- 0.0039 *$3
mem-cb4 : 0.17094 +- 0.00017 *$2 + 4.5186 +- 0.0095 *$3
mem-cb2 : 0.17737 +- 0.00021 *$2 + 4.946 +- 0.012 *$3
mem-WY : 0.20418 +- 0.00036 *$2 + 22.792 +- 0.025 *$3
mem-wy : 0.20431 +- 0.00059 *$2 + 11.031 +- 0.028 *$3
mem-xx : 0.23417 +- 0.00025 *$2 + 36.827 +- 0.041 *$3
mem-nim2 : 0.4461 +- 0.0012 *$2 + 13.848 +- 0.092 *$3
mem-rj : 0.83094 +- 0.00025 *$2 + 17.465 +- 0.045 *$3
mem-mmr : 0.86504 +- 0.00034 *$2 + -0.985 +- 0.019 *$3
mem-ph : 0.89815 +- 0.00035 *$2 + 11.662 +- 0.016 *$3
mem-sip : 0.97731 +- 0.00017 *$2 + 23.887 +- 0.019 *$3
mem-bz1 : 1.38698 +- 0.00029 *$2 + 0.327 +- 0.039 *$3
mem-cb3 : 1.39364 +- 0.00037 *$2 + 0.663 +- 0.042 *$3
mem-cb1 : 1.40357 +- 0.00097 *$2 + 4.155 +- 0.052 *$3
mem-gnu : 2.08673 +- 0.00042 *$2 + 4.112 +- 0.025 *$3
mem-glib : 2.08804 +- 0.00031 *$2 + 3.368 +- 0.038 *$3
mem-djb2 : 2.08806 +- 0.00019 *$2 + 3.352 +- 0.030 *$3
mem-fnv0a : 2.78310 +- 0.00017 *$2 + -0.137 +- 0.023 *$3
mem-fnv0 : 2.78315 +- 0.00021 *$2 + -0.193 +- 0.042 *$3
mem-fnv1a : 2.78325 +- 0.00024 *$2 + 1.239 +- 0.027 *$3
mem-fnv1 : 2.78330 +- 0.00023 *$2 + 1.258 +- 0.023 *$3
mem-sdbm : 2.7855 +- 0.0017 *$2 + 2.029 +- 0.097 *$3
mem-elf : 3.47865 +- 0.00022 *$2 + 2.701 +- 0.019 *$3
mem-nim : 3.4805 +- 0.0021 *$2 + 3.341 +- 0.068 *$3
mem-pjw : 4.28175 +- 0.00042 *$2 + 3.945 +- 0.039 *$3
mem-ELF : 4.95375 +- 0.00022 *$2 + 3.268 +- 0.037 *$3
mem-crc : 5.5681 +- 0.0010 *$2 + 3.52 +- 0.17 *$3
Run
The numbers in the table are **_" full-speed CPU clock cycles per byte" and
"per hash"_** and the **_uncertainties_** in those numbers come from from a
**_straight line regression_** wherein the **_minimum time_** over some set of
runs is used as the "y" and the **_number of bytes_** as the "x". So, the slope
of this line is the first number and time per byte and its intercept is the
time per hash. The noise is kept low by doing the minimum part. Unless you show
the full plot (with a swarm of trials) of time vs. nBytes, uncertainties are a
good way to summarize spread/range of values and people do love their
simplified summaries. :-) { and on that note, I will refrain from defining what
all those hashes are. Many are common enough to know, though. }
This fitting to a min(many) idea is a good way to **_hot-cache-measure_**
string hashing or many other scale-dependent problems since you get but also
minimize uncertainties/measurement error (from the shape of the parabola at the
least squares minimum or bootstrap replications - you can actually "just use"
my [fitl](https://github.com/c-blake/fitl) program to do that regression if you
want and even get a bootstrap covariance matrix).
You can use @Vindaar's [Measuremancer](https://scinim.github.io/Measuremancer/)
to do **_uncertain-value gymnastics_** (like take differences, ratios,
reciprocals). NOTE: you want to use "times" not "rates" because times are
additive (what `R` or statisticians call an "additive model" more generally).
So, only do 1/time-ber-byte to get the oft used bytes/time way people quote
hash function performance at the very end (or not at all!).
In addition to speed, one must always assess **_hash distribution_** as @PMunch
did very preliminarily. E.g., that fastest `fold` hash is just each 8-byte
block xor'ed on top of itself and has a terrible distribution against most
inputs. FWIW, I think gcc has gotten better at optimizing Google's Farm Hash
impls over the years - like 2..3X better time/byte than many years ago. **_Farm
hash_** is the winner in the above table, all things considered, but again -
only one CPU, and these are C not Nim impls which makes conclusions a bit
suspect given my observed sensitivity of assembly-generation. Someone should
really re-do this in pure Nim to truly assess.
I actually have a **_Nim impl of WangYi 's hash_** over at the top of
<https://github.com/c-blake/cligen/blob/master/cligen/strUt.nim> , but it still
needs some love to grow a giant `case` statement to handle short strings well
(as his C version has always done).
* * *
Also, I have not studied that benchmark in detail, but in terms of that more
broadly - reading its rules I suspect a table (`adix/lptabz` or `std/tables`)
with a `cligen/mslice.MSlice` key & value that is a 128-bit (16 byte) integer
would probably be a better idea as this **_removes all allocation_** from
building it. You can **_easily get a quick_** impression of how much faster
that is by altering the rules slightly (only 64 tags, not 100) and using
`[MSlice,uint64]`. (I haven't measured it, but presumably very few if any posts
have 65..100 tags.)
Anyway, with no-hash-caching via `adix/lptabz` that would keep each hash cell
to 1/2 a 64B cache line - 16 for MSlice and 16 for the inline bit-vector. So,
you would need a collision depth > 2 (made rare with Robin Hood) to have more
than 1 cache miss per access and you could probably also do fast bit-wise
operations on the tag sets (pop-count, bitwise and/or, etc.) In fact,
<https://github.com/c-blake/bu/blob/main/wsample.nim> **_solves almost exactly
this problem_** in its `loadWeights` proc (but just with `std/tables`).
On a benchmark methodology note, it's also worth noting that the scale of this
problem is very small - probably fully L2-CPU-cache resident. This means two
things. First, it makes the benchmark very sensitive to whether a given
prog.lang. (PL) impl can achieve great memory density. That makes it almost
**_maximally misleading_** (which in turn makes it **_maximum clickbait_** ).
Second, it means times in the range of 10s of milliseconds which are
**_probably quite unstable_** / polluted by OS/competition noise and best
measured by something like
[tim](https://github.com/c-blake/bu/blob/main/doc/tim.md), not whatever that
bench guy did which does not seem to even assess time reproducibility.
* * *
As far as Nim goes, personally, I think the biggest issue with the current
`hash(string)` is that its output is **_only-32-bits_** for compatibility with
JavaScript backends not using `JsBigInt`. Also, as Araq mentions, someone (TC,
I think) convinced him (which I believe he now regrets) to **_have `Table` be
`const`-able_** which can create very confusing situations if you use a
different run-time hash from the compile-time one, but OTOH can be nice if you
do _not_ change the run-time hash.