>c:=a+b
>
>RISC: add a,b,c
>
>CISC: mov a,c (1)
> add b,c (2)
>
>When (2) depends on (1), there is no possibility for parallelization.
>Good, you can argue that instructions on 3 operators are better than
>the one on 2 instructions.
or look at a slight variation on the theme....
C code:
A[i] += B[i]; // int *A, *B, i;
In Intel, you could...
mov ebx, [i]
mov esi, [A]
mov edi, [B]
; assume ebx,esi,edi loaded as above
Mov eax,[esi+4*ebx]
add [edi+4*ebx],eax
while in RISC it would be something like(I dunno mips specifically, but this
should be close enuf...)
;
; [using destination, source1, source2 syntax]
ld r2, I
ld r3, A
ld r4, B
; assume r2,r3,r4 loaded as above
; r5,r6,r7 as temps
shl R5,R2,2
add r6,r3,r5
add r7,r4,r5
ld r0,@r6
ld r1,@r7 ; *
add r0,r0,r1
sto r0,@r6
since the 3 operand instruction of the RISC engines are strictly
register-to-register.
re: the comment about sequential instruction blocking: You can interleave
operations in between the blocked lines.
re: another comment made about compilers versus hand tweaked assembler: The
nature of the Fast Fourier Transforms performed by the Lucas Lehmer tests are
such that large amounts of complex indexed memory must be fetched repeatedly.
While register optimization IS one important component, cache miss scheduling
and timing is far murkier area where great performance boosts can only be
achieved by very carefully 'timing' the position of the loads in the over all
instruction stream. For instance, at the point marked "*" in the RISC code
above, you'd probably want to put about 12 OTHER instructions that did
something completely different to allow the pipelined cache misses to
complete.
Now, sure, those fancy addressing modes take several clocks to execute on the
pentium. The add to memory takes an extra clock or three as it has to do a
read-modify-write of memory (which however is heavily pipelined).
anyways, all I'm really proving here is the CISC vs RISC wars rage on.
Things are going to be even more fun when Intel's next generation
architecture, the IA-64, hits the scene. If I understand correctly, each
'instruction' encoded expressly controls several parallel execution units in a
VLIW fashion. Now, I once attempted hand programming of an engine like this
(one of the so called media processors that didn't do so well in the market)
and discovered that achieving the architectures theoretical several Giga
ops/sec was nearly impossible with hand crafted programming.
-jrp