Nicolas Boulay wrote:
One or 2 years ago, somebody post many real world shader code. Despite
the fact that opengl arb propose vector operation, most of the
instructions used are scalar. So a simd processor have no interrest
for this kind of code.

I guess it depends on what you mean by scalar code. IIUC, this is our prototype code:

https://svn.suug.ch/repos/opengraphics/main/trunk/new_model/ogmodel.cpp

Most of this code which does arithmetic is vector and matrix. It is written as scalar, but it should be obvious that that it is vector and matrix code which has been 'unrolled' into the actual scalar operations. So, it could also be stated as matrix operations and run on a SIMD vector processor.

Maybe you have eard that ati or nvidia are switch to "scalar" core.
Maybe you know why, now. Befor, rumors said that they use 2 way simd
core.

All arithmetic operations are performed on a scaler arithmetic blocks, the difference is how these blocks are organized. The most basic combination is to combine a multiplier and an adder to make a Multiply Accumulate block (MAC). Then we can put 4 of these MACs side by side with common control logic and we have a 4-word SIMD vector processor. It does the same work as 4 MACs, only the control structure is different. IIUC, ATI and nVidia both use a configurable array of general purpose arithmetic units; this is why they are useful as supercomputers.

If you use 1 cpu core with a SIMD engine of 4 ways or 4 scalar cpu,
you will need the same data bandwith to fill all the units. The
difference is that the SIMD core will be less efficient for advance
shader code.

It depends on what you mean by less efficient! Do you just mean faster? Are you saying that a 4-word SMID arithmetic unit will be less efficient at executing vector and matrix operations? Yes, you can make a faster processor to do this, but it will be MORE hardware. It can only be made faster by having more hardware multiply arrays (these are the major expense [in chip real estate]).

The most used instruction is "add" then "mul". For the maximum
effisciency (mips/Si mm²),   the core must sustain one "mul" par cycle
and why not 2 adds.

Actually, the most common instruction for this type of code is MAC. If you refer to the code [mentioned above] you will see that most of the arithmetic statements are of the form x = y1*a1 + y2*a2 + ... yn*an. Such code is most efficiently executed with the MAC instruction which eliminates the pipeline dependency on the multiply. It can also be completely decomposed and run on a systolic array that has one multiplier per multiply. This will clearly be faster, but it will require a lot more hardware.

I hope that this is now clear. IIUC, your argument seems to be that you can some how do things faster without having more hardware by decomposing the vector operations into scalar operations. This is clearly wrong. You are leaving out the fact that to do this, you will need more hardware to run the problem. And more hardware is more hardware so it will obviously run the problem faster.

Exactly how to organize the hardware is a question which I don't have a definite answer to. However, the organization needs to be based on the algorithm that will be run on it. Clearly, the prototype code is vector although it has been unrolled into scaler. You unroll stuff when you have a parallel processor to run it on, otherwise, there is nothing to gain.

--
JRT

_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Reply via email to