Peter Graf makes some magical things to make me read
} Marcel wrote:
}
} >>>...haven't tried bogomips yet
} >> And my 80 MHz Q60 (GCC-compiled): 101010 Dhrystone/s
} >
} >Wow.
} >
} >> BTW the effective speed of the x86 chips seems to rise slower than their
} >> clock frequency so it'll probably take much more than a 3.1 GHz x86 chip.
} >
} >Not true for QPC. The speed increase from a P75 to an Athlon 1000 was
} >always almost linear.
}
} So we have different experience with Windows PCs. With those I tried, the
} speed increase was significantly less than linear for QPC as well.
}
PC users, Beware: the frequency of the processor is mainly marketing-hype:
a P3 has upto 12 decomposition steps to process one instruction.
and the P4 goes to 20 !
So, if the program instructions cannot be executed in pipeline
(because the n+1 need the result of n), the performances really drop !
Moreover, even with a 64, 128 or 256 bits memory access, the speed to read
and write the memory is still limited to either 66, 100 or 133 MHz.
Given the bad scheme to access the memory (first send column to access,
then the row and finally get/put the data), accessing very randomly
the memory means that with a 66 MHz bus, you can be limited to
read at 22 Mhz (it may even be worst, as some memory have a latency
so that it take two or three bus cycle to get the data... hence
the idea they came of Rambus, where the memory gets a queue of requests and
thus can try to have a pipeline-like approach (the execution of ONE instruction
may take 60 cycles, but the next request can be issued before the current is
finished, thus overlapping the time needed for the execution of TWO instructions).
One last thing, some processors have two ULAs, instead of previously one
in the old model, it somehow makes up for keeping the linear aspect
(sometime, it even perform better), but when you know the internal,
there is really a huge waste when you compare to what you would have
expected.
P.S: SSE and other instructions are so interestings for the PC because they
performs over multiple contiguous bytes/word explicitely, thus allowing to
somehow optimized access to the memory (instead of getting
x[0] then x[1] and then x[2] (where x can be a 32 bit integers)
with three consecutive instructions, on 128 bitwide bus , there is
only one instruction which make one memory access (instead of three).