Bob La Quey wrote:
When last I looked into this and it has been at
least a decade RISC cpus piped large numbers of
instructions through to effectively emulate what
a CISC was doing. In addition internal clock rates
were often higher than external ones. So the
mismatch was worse than just the external clock to
memory cycle would indicate.
I think you two are talking at cross purposes here. If I read you both
correctly, one is talking about "code compression" and making sure that
the "instruction bandwidth" is fed. The other is talking about the
ability for a tight code loop to blast "data bandwidth". Modern
processors really don't have an instruction bandwidth problem very much.
The bigger problem is waiting for computations to complete and for
values to get back and forth from memory. Data bandwidth is the much
harder problem (although compression codes *will* cause instruction
bandwidth problems and pointer-chasing systems can stress your branch
predictors).
There are two problems in modern memory vs. CPU match.
The first is DRAM itself. DRAM goes for high density over performance
every time. This is a function of history and you simply cannot get any
memory manufacturer to take the risk to try a higher performing initial
access DRAM instead of higher density. Thus, we are still at 10's of
nanoseconds for initial access in spite of huge bandwidth and process
improvements.
2/3 of a modern microprocessor is there to cover up this tradeoff in
DRAM. This makes sense because your final repository is a disk drive
which is milliseconds in latency and has a limit of about low 10's
Megabytes/sec in transfer. Sure, there are codes that are memory bound,
but DRAM gets a pass because almost everything we use in a modern system
is so much slower than RAM that it doesn't matter. Networks are 10's of
milliseconds in latency and bandwidths of low 10's of megabytes per
second. Disks are milliseconds in latency and bandwidths of high 10's
of megabytes per second. DRAM is so much faster that it works fine as
an extra cache layer.
Flash upsets this balance. Once your latency to nonvolatile storage is
now in the 10's of microseconds range and bandwidth is in the 100's of
megabytes per second, you start wondering what DRAM buys you. For now,
it's write performance (flash just *sucks* at random writes). However,
people will start designing around the limitation given how much battery
life you could save by removing hard drives from laptop systems.
Besides, we batch up writes to disks already.
The second part of the problem is that the trace length between the
processor and the memory is somewhere in the range of 15-30cm. This is
on the order of 1/2 to 1ns even at the speed of light (and the electrons
do not move at that speed in copper). Even if you had a RAM latency of
0, sending the command and getting the response is limited to about 1ns
in latency. That means that we have about a 2-3 to 1 CPU to memory
mismatch that can never be fixed other than by parking memory right on
the CPU.
-a
--
[email protected]
http://www.kernel-panic.org/cgi-bin/mailman/listinfo/kplug-lpsg