Bob La Quey wrote:

When last I looked into this and it has been at
least a decade RISC cpus piped large numbers of
instructions through to effectively emulate what
a CISC was doing. In addition internal clock rates
were often higher than external ones. So the
mismatch was worse than just the external clock to
memory cycle would indicate.

I think you two are talking at cross purposes here. If I read you both correctly, one is talking about "code compression" and making sure that the "instruction bandwidth" is fed. The other is talking about the ability for a tight code loop to blast "data bandwidth". Modern processors really don't have an instruction bandwidth problem very much. The bigger problem is waiting for computations to complete and for values to get back and forth from memory. Data bandwidth is the much harder problem (although compression codes *will* cause instruction bandwidth problems and pointer-chasing systems can stress your branch predictors).

There are two problems in modern memory vs. CPU match.

The first is DRAM itself. DRAM goes for high density over performance every time. This is a function of history and you simply cannot get any memory manufacturer to take the risk to try a higher performing initial access DRAM instead of higher density. Thus, we are still at 10's of nanoseconds for initial access in spite of huge bandwidth and process improvements.

2/3 of a modern microprocessor is there to cover up this tradeoff in DRAM. This makes sense because your final repository is a disk drive which is milliseconds in latency and has a limit of about low 10's Megabytes/sec in transfer. Sure, there are codes that are memory bound, but DRAM gets a pass because almost everything we use in a modern system is so much slower than RAM that it doesn't matter. Networks are 10's of milliseconds in latency and bandwidths of low 10's of megabytes per second. Disks are milliseconds in latency and bandwidths of high 10's of megabytes per second. DRAM is so much faster that it works fine as an extra cache layer.

Flash upsets this balance. Once your latency to nonvolatile storage is now in the 10's of microseconds range and bandwidth is in the 100's of megabytes per second, you start wondering what DRAM buys you. For now, it's write performance (flash just *sucks* at random writes). However, people will start designing around the limitation given how much battery life you could save by removing hard drives from laptop systems. Besides, we batch up writes to disks already.

The second part of the problem is that the trace length between the processor and the memory is somewhere in the range of 15-30cm. This is on the order of 1/2 to 1ns even at the speed of light (and the electrons do not move at that speed in copper). Even if you had a RAM latency of 0, sending the command and getting the response is limited to about 1ns in latency. That means that we have about a 2-3 to 1 CPU to memory mismatch that can never be fixed other than by parking memory right on the CPU.

-a

--
[email protected]
http://www.kernel-panic.org/cgi-bin/mailman/listinfo/kplug-lpsg

Reply via email to