related_post_gen benchmark

cblake Wed, 11 Oct 2023 05:30:18 -0700

(key concepts are triple-star bold-italic for your reading ease)

@inv2004 \- fnv1 does byte-at-a-time processing which is slower compared to 
8-byte at a time processing on modern CPUs just because there are many fewer 
loops. So, **_for example_** , here is a table I made { on just one CPU 
(i7-6700k) } for **_string hashing with C implementations_** :
    
    
    mem-fold  : 0.03162 +- 0.00012 *$2 + 6.036   +- 0.012  *$3
    mem-farm  : 0.16570 +- 0.00014 *$2 + -1.8574 +- 0.0039 *$3
    mem-cb4   : 0.17094 +- 0.00017 *$2 + 4.5186  +- 0.0095 *$3
    mem-cb2   : 0.17737 +- 0.00021 *$2 + 4.946   +- 0.012  *$3
    mem-WY    : 0.20418 +- 0.00036 *$2 + 22.792  +- 0.025  *$3
    mem-wy    : 0.20431 +- 0.00059 *$2 + 11.031  +- 0.028  *$3
    mem-xx    : 0.23417 +- 0.00025 *$2 + 36.827  +- 0.041  *$3
    mem-nim2  : 0.4461  +- 0.0012  *$2 + 13.848  +- 0.092  *$3
    mem-rj    : 0.83094 +- 0.00025 *$2 + 17.465  +- 0.045  *$3
    mem-mmr   : 0.86504 +- 0.00034 *$2 + -0.985  +- 0.019  *$3
    mem-ph    : 0.89815 +- 0.00035 *$2 + 11.662  +- 0.016  *$3
    mem-sip   : 0.97731 +- 0.00017 *$2 + 23.887  +- 0.019  *$3
    mem-bz1   : 1.38698 +- 0.00029 *$2 + 0.327   +- 0.039  *$3
    mem-cb3   : 1.39364 +- 0.00037 *$2 + 0.663   +- 0.042  *$3
    mem-cb1   : 1.40357 +- 0.00097 *$2 + 4.155   +- 0.052  *$3
    mem-gnu   : 2.08673 +- 0.00042 *$2 + 4.112   +- 0.025  *$3
    mem-glib  : 2.08804 +- 0.00031 *$2 + 3.368   +- 0.038  *$3
    mem-djb2  : 2.08806 +- 0.00019 *$2 + 3.352   +- 0.030  *$3
    mem-fnv0a : 2.78310 +- 0.00017 *$2 + -0.137  +- 0.023  *$3
    mem-fnv0  : 2.78315 +- 0.00021 *$2 + -0.193  +- 0.042  *$3
    mem-fnv1a : 2.78325 +- 0.00024 *$2 + 1.239   +- 0.027  *$3
    mem-fnv1  : 2.78330 +- 0.00023 *$2 + 1.258   +- 0.023  *$3
    mem-sdbm  : 2.7855  +- 0.0017  *$2 + 2.029   +- 0.097  *$3
    mem-elf   : 3.47865 +- 0.00022 *$2 + 2.701   +- 0.019  *$3
    mem-nim   : 3.4805  +- 0.0021  *$2 + 3.341   +- 0.068  *$3
    mem-pjw   : 4.28175 +- 0.00042 *$2 + 3.945   +- 0.039  *$3
    mem-ELF   : 4.95375 +- 0.00022 *$2 + 3.268   +- 0.037  *$3
    mem-crc   : 5.5681  +- 0.0010  *$2 + 3.52    +- 0.17   *$3
    
    
    Run


The numbers in the table are **_" full-speed CPU clock cycles per byte" and 
"per hash"_** and the **_uncertainties_** in those numbers come from from a 
**_straight line regression_** wherein the **_minimum time_** over some set of 
runs is used as the "y" and the **_number of bytes_** as the "x". So, the slope 
of this line is the first number and time per byte and its intercept is the 
time per hash. The noise is kept low by doing the minimum part. Unless you show 
the full plot (with a swarm of trials) of time vs. nBytes, uncertainties are a 
good way to summarize spread/range of values and people do love their 
simplified summaries. :-) { and on that note, I will refrain from defining what 
all those hashes are. Many are common enough to know, though. }

This fitting to a min(many) idea is a good way to **_hot-cache-measure_** 
string hashing or many other scale-dependent problems since you get but also 
minimize uncertainties/measurement error (from the shape of the parabola at the 
least squares minimum or bootstrap replications - you can actually "just use" 
my [fitl](https://github.com/c-blake/fitl) program to do that regression if you 
want and even get a bootstrap covariance matrix).

You can use @Vindaar's [Measuremancer](https://scinim.github.io/Measuremancer/) 
to do **_uncertain-value gymnastics_** (like take differences, ratios, 
reciprocals). NOTE: you want to use "times" not "rates" because times are 
additive (what `R` or statisticians call an "additive model" more generally). 
So, only do 1/time-ber-byte to get the oft used bytes/time way people quote 
hash function performance at the very end (or not at all!).

In addition to speed, one must always assess **_hash distribution_** as @PMunch 
did very preliminarily. E.g., that fastest `fold` hash is just each 8-byte 
block xor'ed on top of itself and has a terrible distribution against most 
inputs. FWIW, I think gcc has gotten better at optimizing Google's Farm Hash 
impls over the years - like 2..3X better time/byte than many years ago. **_Farm 
hash_** is the winner in the above table, all things considered, but again - 
only one CPU, and these are C not Nim impls which makes conclusions a bit 
suspect given my observed sensitivity of assembly-generation. Someone should 
really re-do this in pure Nim to truly assess.

I actually have a **_Nim impl of WangYi 's hash_** over at the top of 
<https://github.com/c-blake/cligen/blob/master/cligen/strUt.nim> , but it still 
needs some love to grow a giant `case` statement to handle short strings well 
(as his C version has always done).

* * *

Also, I have not studied that benchmark in detail, but in terms of that more 
broadly - reading its rules I suspect a table (`adix/lptabz` or `std/tables`) 
with a `cligen/mslice.MSlice` key & value that is a 128-bit (16 byte) integer 
would probably be a better idea as this **_removes all allocation_** from 
building it. You can **_easily get a quick_** impression of how much faster 
that is by altering the rules slightly (only 64 tags, not 100) and using 
`[MSlice,uint64]`. (I haven't measured it, but presumably very few if any posts 
have 65..100 tags.)

Anyway, with no-hash-caching via `adix/lptabz` that would keep each hash cell 
to 1/2 a 64B cache line - 16 for MSlice and 16 for the inline bit-vector. So, 
you would need a collision depth > 2 (made rare with Robin Hood) to have more 
than 1 cache miss per access and you could probably also do fast bit-wise 
operations on the tag sets (pop-count, bitwise and/or, etc.) In fact, 
<https://github.com/c-blake/bu/blob/main/wsample.nim> **_solves almost exactly 
this problem_** in its `loadWeights` proc (but just with `std/tables`).

On a benchmark methodology note, it's also worth noting that the scale of this 
problem is very small - probably fully L2-CPU-cache resident. This means two 
things. First, it makes the benchmark very sensitive to whether a given 
prog.lang. (PL) impl can achieve great memory density. That makes it almost 
**_maximally misleading_** (which in turn makes it **_maximum clickbait_** ). 
Second, it means times in the range of 10s of milliseconds which are 
**_probably quite unstable_** / polluted by OS/competition noise and best 
measured by something like 
[tim](https://github.com/c-blake/bu/blob/main/doc/tim.md), not whatever that 
bench guy did which does not seem to even assess time reproducibility.

* * *

As far as Nim goes, personally, I think the biggest issue with the current 
`hash(string)` is that its output is **_only-32-bits_** for compatibility with 
JavaScript backends not using `JsBigInt`. Also, as Araq mentions, someone (TC, 
I think) convinced him (which I believe he now regrets) to **_have `Table` be 
`const`-able_** which can create very confusing situations if you use a 
different run-time hash from the compile-time one, but OTOH can be nice if you 
do _not_ change the run-time hash.

related_post_gen benchmark

Reply via email to