One other thing I can say that may be helpful is that when you do the "max time
of N samples" you shift the probability distribution (not density) by an
_exponent_ of N. See
[https://en.wikipedia.org/wiki/Extreme_value_theory](https://en.wikipedia.org/wiki/Extreme_value_theory)
for details. Basically, if the latency has distribution F(t), the worst of 5
has distribution F(t)**5. The median is defined as dist=0.5. So, the median of
the worst of 5 corresponds to 1-0.5**5 = 96.875th percentile time. Similarly,
the median of the min time of 5 runs would be the 3.125th percentile of the
base time distribution. So best/worst of 5 you expect (in a median sense) to
sample the lower/upper tail of the base distribution. A lot of the time in
benchmarking one takes the min of N to filter out all the noise from other
things going on in the system. Unlike many things in statistics these results
do not rely on anything about the shape of F(t)..It's just basic probability
theory.
What this means in this benchmark example is that it is going to be very
sensitive (by construction if not intent) to **a lot** of competing activity on
the system - network interrupts, competing processes, OS kernel goings on,
L2/L3 cache evictions, etc. So, it is "noisy by design". The bigger N the
noisier the max will be. For example, if you have a Winchester hard drive on
the system and you make N big enough you will sample some suspension of the
process, CPU migration, total re-warm up of L2 cache, and however long the disk
service takes. None of that really relates _directly_ to "memory manager
performance". It's more measuring how loaded your system is.
Anyway, to me this explains the high run to run variability/difficulty of
interpretation/difficulty to reproduce here. 97th percentile is pretty high and
that's just the median. The tail of F(t)**5 is even worse than the median. And
those tail-tail events are ever less relevant to comparing memory managers. To
the extent they are of interest, because they are so very noisy, it's going to
require a lot of sampling to assess them.
The truly worst-worst-worst cases where everything is cold cache are actually
pretty hard to measure for a variety of reasons. A lot of the time "hot loop
benchmarks" can be **_very_** misleading about worst case latency and even pros
mess this up _all the time_. I could provide a couple of fun examples sometime.
Memory is mostly about DIMM/L3/L2/L1 latencies and bandwidths, though. You can
measure those and reason from there more by looking at code and "calculating"
from those measured constants than by timing it. Often cold cache DIMM hits are
the most important thing if you aren't moving a lot of memory around. Then
number of copies if you are moving. { This is why I like linear probing hash
tables (with|without only local Robin Hood reorg which can help insert-heavy
workloads). You just know it's basically just one 64-128 byte cache fill for
really big tables when it's most painful, and you can't do better than 1. }