Oh, and that 8.2x was on the very same machine as my above result tables and quite a bit larger than _any_ margin I saw on your tests with a PGO build and fast hash. To me that makes your tests almost totally inconclusive, at least in terms of "truly random" access.
I agree it's a tricky situation. Even workload record & playback can have trouble. Cycle counters like `rdtsc` are another approach, but those also need special care. You can do serializing insns near them (perhaps to fake better performance, `rdtsc` _used_ to be serializing itself). Then that can cost more than "L1 speed" operations giving only a kind of upper bound to time. For long running, out of cache ops that can work, though. Anyway, you aren't the first and won't be the last person I mention this problem to. It's neither obvious nor discussed as widely as it should be, and I don't mean any kind of personal insult. Many people reject "cache miss counting" analyses in favor of direct timing for other reasons (eg., suspicion of theory habits). There are pros & cons. Counting loads/stores is sometimes easy and is less fooled by CPUs, but also only applies at large scale. I don't claim to have all the answers.