Oh, and that 8.2x was on the very same machine as my above result tables and 
quite a bit larger than _any_ margin I saw on your tests with a PGO build and 
fast hash. To me that makes your tests almost totally inconclusive, at least in 
terms of "truly random" access.

I agree it's a tricky situation. Even workload record & playback can have 
trouble. Cycle counters like `rdtsc` are another approach, but those also need 
special care. You can do serializing insns near them (perhaps to fake better 
performance, `rdtsc` _used_ to be serializing itself). Then that can cost more 
than "L1 speed" operations giving only a kind of upper bound to time. For long 
running, out of cache ops that can work, though.

Anyway, you aren't the first and won't be the last person I mention this 
problem to. It's neither obvious nor discussed as widely as it should be, and I 
don't mean any kind of personal insult. Many people reject "cache miss 
counting" analyses in favor of direct timing for other reasons (eg., suspicion 
of theory habits). There are pros & cons. Counting loads/stores is sometimes 
easy and is less fooled by CPUs, but also only applies at large scale. I don't 
claim to have all the answers.

Reply via email to