On the discussion about GPUs for OGP:

Here is an off the wall idea a friend and I were talking about a while
back. I think that it would be of great use in a GPU. The idea is to
create a "modularized" RISC processor. Here is a basic rundown:

The CPU accepts 3 instructions:
load
store
mov

All operations are actually modules that expose one or more registers.
A add module would look like this:

addin (in)
addins (in)
addout (out)

Every clock cycle the contents of addin and addins are added and the
result is stored into addout. Now this doesn't sound all that good,
except when you realize that in current processors the pipeline must
sit idle while a result is being computed. When you get to
multiplication this can be 3-4 cycles! So let's say we need to
multiply three values. On a normal processor it would look like this:

command               (cycles)

mov 1, reg1           (1)
mov 2, reg 2          (1)
mul reg1, reg2       (4)
mov reg3, result     (1)

mov 1, reg1           (1)
mov 2, reg 2          (1)
mul reg1, reg2       (4)
mov reg3, result2    (1)

mov 1, reg1           (1)
mov 2, reg 2          (1)
mul reg1, reg2       (4)
mov reg3, result3    (1)

mov 1, reg1           (1)
mov 2, reg 2          (1)
mul reg1, reg2       (4)
mov reg3, result4    (1)

Total clock cycles: 28

Modularized RISC method:

command            (cycles all 1)

mov 1, mulin1
mov 2, mulin2
mov 1, mulin1
mov 2, mulin2
mov 1, mulin1
#At this point the result from line 2 is ready so we move it out
mov mulout, result1
mov 2, milin2
#And now we can move the result from line 4
mov mulout, result2
mov 1, mulin1
mov 2, mulin2
#Result from line 7
mov mulout, result3
mov 0, sink #Wait a cycle
mov 0, sink #Wait a cycle
mov 0, sink #Wait a cycle
mov mulour result 3

Total clock cycles: 15

Now let's say that we could execute two move instructions at a time.
Then the code would look like this

mov 1, mulin1 : mov 2, mulin2
mov 1, mulin1 : mov 2, mulin2
mov 1, mulin1 : mov 2, mulin2
mov 1, mulin1 : mov 2, mulin2
mov mulout, result1
mov mulout, result2
mov mulout, result3
mov mulout, result4

Total clock cycles: 8

Some of this code probably has errors (I can even see some just
looking at it now), but you get the point. Granted I know little about
CPU design, but I see this sort of CPU being very powerful. You get
the speed of a vector processor, with the simplicity of a RISC design.
A full blown CPU like this would have anywhere from 4 - 8 moves per
clock cycle. Multiple calculation units could be included as well.
Because these "registers" would be in a register file. Adding modules
would be both trivial and highly useful. Some would even be backwards
compatible.

I see this as a Itanium (i.e. the compiler does most of the
optimizations) in a RISC package.

Thoughts?

Timothy
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Reply via email to