On the discussion about GPUs for OGP: Here is an off the wall idea a friend and I were talking about a while back. I think that it would be of great use in a GPU. The idea is to create a "modularized" RISC processor. Here is a basic rundown:
The CPU accepts 3 instructions: load store mov All operations are actually modules that expose one or more registers. A add module would look like this: addin (in) addins (in) addout (out) Every clock cycle the contents of addin and addins are added and the result is stored into addout. Now this doesn't sound all that good, except when you realize that in current processors the pipeline must sit idle while a result is being computed. When you get to multiplication this can be 3-4 cycles! So let's say we need to multiply three values. On a normal processor it would look like this: command (cycles) mov 1, reg1 (1) mov 2, reg 2 (1) mul reg1, reg2 (4) mov reg3, result (1) mov 1, reg1 (1) mov 2, reg 2 (1) mul reg1, reg2 (4) mov reg3, result2 (1) mov 1, reg1 (1) mov 2, reg 2 (1) mul reg1, reg2 (4) mov reg3, result3 (1) mov 1, reg1 (1) mov 2, reg 2 (1) mul reg1, reg2 (4) mov reg3, result4 (1) Total clock cycles: 28 Modularized RISC method: command (cycles all 1) mov 1, mulin1 mov 2, mulin2 mov 1, mulin1 mov 2, mulin2 mov 1, mulin1 #At this point the result from line 2 is ready so we move it out mov mulout, result1 mov 2, milin2 #And now we can move the result from line 4 mov mulout, result2 mov 1, mulin1 mov 2, mulin2 #Result from line 7 mov mulout, result3 mov 0, sink #Wait a cycle mov 0, sink #Wait a cycle mov 0, sink #Wait a cycle mov mulour result 3 Total clock cycles: 15 Now let's say that we could execute two move instructions at a time. Then the code would look like this mov 1, mulin1 : mov 2, mulin2 mov 1, mulin1 : mov 2, mulin2 mov 1, mulin1 : mov 2, mulin2 mov 1, mulin1 : mov 2, mulin2 mov mulout, result1 mov mulout, result2 mov mulout, result3 mov mulout, result4 Total clock cycles: 8 Some of this code probably has errors (I can even see some just looking at it now), but you get the point. Granted I know little about CPU design, but I see this sort of CPU being very powerful. You get the speed of a vector processor, with the simplicity of a RISC design. A full blown CPU like this would have anywhere from 4 - 8 moves per clock cycle. Multiple calculation units could be included as well. Because these "registers" would be in a register file. Adding modules would be both trivial and highly useful. Some would even be backwards compatible. I see this as a Itanium (i.e. the compiler does most of the optimizations) in a RISC package. Thoughts? Timothy _______________________________________________ Open-graphics mailing list [email protected] http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
