One sad feature of modern 3-level cache CPU architectures (at least AMD and Intel that I have dealt with, non-NUMA) is that memory bandwidth from one CPU core to DIMMs is substantially less (around 2.5..5x in my usual measurements) than _aggregate_ memory/DIMM bandwidth. Even to copy data _untranslated_ benefits from multi-threading by about 3x in my measurements even though one CPU core is more than capable of instruction throughput to saturate the DIMMs.
I am unaware of the circuit cost/heat dissipation/etc. trade offs CPU engineers make to have too few "lanes" from each core to the DIMMs (I would love to see slides/papers on this if anyone knows them). Usually "wider" data paths are not that much of a burden. In my view this property induces a great deal of otherwise unnecessary thread-level parallelism and its attendant software complexities (hence "sad").