On Friday, 30 January 2015 at 14:41:11 UTC, Laeeth Isharc wrote:
Thanks, Adam. That's what I had thought (your first paragraph), but something Ola on a different thread confused me and made me think I didn't understand it, and I wanted to pin it down.

There is always significant optimization effects in long running loops:
- SIMD
- cache locality / prefetching

For the former (SIMD) you need to make sure that good code is generated either by hand, by using vectorized libraries or by auto vectorization.

For the latter (cache) you need to make sure that the prefetcher is able to predict or is being told to prefetch explicitly and also that the working set is small enough to stay at the faster cache levels.

If you want good performance you cannot ignore any of these, and you have to design the data structures and algorithms for it. Prefetching has to happen maybe 100 instructions before the actual load from memory and AVX requires byte alignment and a layout that fits the algorithm. On next gen Xeon Skylake I think the alignment might go up to 64 byte and you have 512 bits wide registers (so you can do 8 64 bit floating point operations in parallel per core). The difference between issuing 1-4 ops and issuing 8-16 per time unit is noticable...

An of course, the closer your code is to theoretical throughput in the CPU, the more critical it becomes to not wait for memory loads.

This is also a moving target...

Reply via email to