Re: [Open-graphics] Sun releases RTL design for Niagra 2 under GPL 2.0

James Richard Tyrer Sat, 15 Dec 2007 09:47:43 -0800

Nicolas Boulay wrote:

One or 2 years ago, somebody post many real world shader code. Despite
the fact that opengl arb propose vector operation, most of the
instructions used are scalar. So a simd processor have no interrest
for this kind of code.

I guess it depends on what you mean by scalar code. IIUC, this is ourprototype code:


https://svn.suug.ch/repos/opengraphics/main/trunk/new_model/ogmodel.cpp

Most of this code which does arithmetic is vector and matrix. It iswritten as scalar, but it should be obvious that that it is vector andmatrix code which has been 'unrolled' into the actual scalar operations.So, it could also be stated as matrix operations and run on a SIMDvector processor.

Maybe you have eard that ati or nvidia are switch to "scalar" core.
Maybe you know why, now. Befor, rumors said that they use 2 way simd
core.

All arithmetic operations are performed on a scaler arithmetic blocks,the difference is how these blocks are organized. The most basiccombination is to combine a multiplier and an adder to make a MultiplyAccumulate block (MAC). Then we can put 4 of these MACs side by sidewith common control logic and we have a 4-word SIMD vector processor.It does the same work as 4 MACs, only the control structure isdifferent. IIUC, ATI and nVidia both use a configurable array ofgeneral purpose arithmetic units; this is why they are useful assupercomputers.

If you use 1 cpu core with a SIMD engine of 4 ways or 4 scalar cpu,
you will need the same data bandwith to fill all the units. The
difference is that the SIMD core will be less efficient for advance
shader code.

It depends on what you mean by less efficient! Do you just mean faster?Are you saying that a 4-word SMID arithmetic unit will be lessefficient at executing vector and matrix operations? Yes, you can makea faster processor to do this, but it will be MORE hardware. It canonly be made faster by having more hardware multiply arrays (these arethe major expense [in chip real estate]).

The most used instruction is "add" then "mul". For the maximum
effisciency (mips/Si mm²),   the core must sustain one "mul" par cycle
and why not 2 adds.

Actually, the most common instruction for this type of code is MAC. Ifyou refer to the code [mentioned above] you will see that most of thearithmetic statements are of the form x = y1*a1 + y2*a2 + ... yn*an.Such code is most efficiently executed with the MAC instruction whicheliminates the pipeline dependency on the multiply. It can also becompletely decomposed and run on a systolic array that has onemultiplier per multiply. This will clearly be faster, but it willrequire a lot more hardware.

I hope that this is now clear. IIUC, your argument seems to be that youcan some how do things faster without having more hardware bydecomposing the vector operations into scalar operations. This isclearly wrong. You are leaving out the fact that to do this, you willneed more hardware to run the problem. And more hardware is morehardware so it will obviously run the problem faster.

Exactly how to organize the hardware is a question which I don't have adefinite answer to. However, the organization needs to be based on thealgorithm that will be run on it. Clearly, the prototype code is vectoralthough it has been unrolled into scaler. You unroll stuff when youhave a parallel processor to run it on, otherwise, there is nothing to gain.


--
JRT

_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Re: [Open-graphics] Sun releases RTL design for Niagra 2 under GPL 2.0

Reply via email to