One sad feature of modern 3-level cache CPU architectures (at least AMD and 
Intel that I have dealt with, non-NUMA) is that memory bandwidth from one CPU 
core to DIMMs is substantially less (around 2.5..5x in my usual measurements) 
than _aggregate_ memory/DIMM bandwidth. Even to copy data _untranslated_ 
benefits from multi-threading by about 3x in my measurements even though one 
CPU core is more than capable of instruction throughput to saturate the DIMMs.

I am unaware of the circuit cost/heat dissipation/etc. trade offs CPU engineers 
make to have too few "lanes" from each core to the DIMMs (I would love to see 
slides/papers on this if anyone knows them). Usually "wider" data paths are not 
that much of a burden. In my view this property induces a great deal of 
otherwise unnecessary thread-level parallelism and its attendant software 
complexities (hence "sad").

Reply via email to