Timothy Normand Miller wrote:
> On Mon, Sep 21, 2009 at 12:17 AM, Hugh Fisher <[email protected]> wrote:
>   
>> Timothy Normand Miller wrote:
>>     
>>> One of the design details that seems to be hard to present is the MIMD
>>> architecture.  At first glance, it looks like a SIMD architecture.
>>> But all of you are right to point out that shader workloads are
>>> primarily scalar.
>>>       
>> I'd like to see some evidence for this.
>>     
>
> A challenge!  :)  People keep telling me that they're primarily scalar
> workloads.  I have accepted what they say.  It may be that
> well-written shader programs are heavily vector but that typical
> shader programs written by typical programmers are not.
>
> Besides the obvious scalar ALU instructions, there are other
> instructions that take bandwidth that are also not vector:  flow
> control, loads and stores
> There's lots of those.  No?
>
> Also, if memory load instruction latency dominates, then none of this
> matters.  Many shader programs will spend most of their time waiting
> on memory, making vector optimizations moot.
>   
For the memory access we are still considering how to do them. Since
each ALU target a different thread. One of the possibility is only allow
indirect memory access. Using a register as a data pointer. It's not
that efficient but it simplify a lot on how to access the memory for
each thread.

Also don't forget that a good chunk of memory access would normally be
for constant. There's going to be a constant register file accessible by
the ALU in read only mode. That would reduce the memory requirement and
liberate more work register for each ALU.

>   
>> Some years ago I wrote a bunch of demonstration GPU shader programs
>> in low level ARB/nVidia assembly. You can still find them at:
>> <http://cs.anu.edu.au/~Hugh.Fisher/3dstuff/lowlevel.html>
>>
>> 80% of the instructions are vector, only 20% scalar. The ratio of
>> scalar instructions increases very slightly with the more complex
>> shaders to perhaps 25%. The single most common instruction is DP,
>> Dot Product, of three or four operands from a vertex/color/matrix.
>>     
>
> I can see vertex shader programs being DP heavy.  But there will be
> far fewer vertexes than fragments.  How DP-heavy are fragment shader
> programs, generally?
>
>   
In either case even if the programmer use a lot of vector math, all
vector math can be reduced to it's scalar component. Most vector
operation in how they work are a lot of small scalar operation and some
part must be done serially. It's more practical in term of chip space to
broke them done as scalar operation.  The high level representation may
be a DP but in modern GPU this DP will be broken down in a scalar
operation to saturate the pipeline. I'm not sure if they implemented
that operation as software or microcode.  Making true vector alu would
probably accelerate  some operation but by doing so you reduce the mean
use of all the alu.

If we do a short evaluation of use :
We get from 20% scalar we use 25% of our alu
Posing an even split between the 3 and 4 operand vertex operation that
bring us to 40% at 3 operation(75%) and 40% at 4 operand (100%)
That give us a common use of : 0.2*0.25 + 0.4 * .75 + 0.4*1 = 75% of
saturation of the resource available.

That amount isn't so bad but that still mean that for every 4 operation
we do we waste 1. Also that 75% is by supposing that we do an operation
like a dot product in 1 pass who is unlikely considering the hardware
requirement. To be practical a dot product would probably require to do
it in 3 pass : 1 multiply and 2 add. Each of those operation with a
diminishing return of activity for each alu for a 4 operand dot product
it would fall to : mult(100%) + add for 2 (50%) + final add(25%) = 58.3%
for the saturation of resource.

Those 2 sample are part of the reason why we target more a MIMD doing
scalar operation. We reduce the idle time for each ALU. A vector
operation will take more time to do. But we can run more thread and
operation without that idle time being present.  We did consider a
vector operation processor at one point, but we rejected that idea after
considerable debate. Another point was possibly doing gathering of
multiple thread to optimize the ALU use. That would augment the
complexity of the instruction scheduler, probably to much to be
realizable with what we have as resource.
>> If you're using shaders to emulate the original fixed function
>> OpenGL/Direct3D pipelines, the ratio of SIMD to scalar will be
>> even higher.
>>
>> OK, my shaders are old, and predate Shader Model 3.0 and widespread
>> use of high level languages. They still do what every 3D engine
>> spends most of its time doing: multiply vertices by a matrix, and
>> RGB/RGBA colors by other colors.
>>
>> I'm happy to be proved wrong on this, but let's do so on the basis
>> of real world shaders written by graphics programmers.
>>     
>
> Yeah.  I agree.  I don't know enough about this myself.  We need to do
> this right without egos or too much guessing.  Others here should be
> able to fill in the gaps.
>
>   
>> I'd suggest a MIPS with each floating point reg extended to 128 bits
>> as 4 x 32 / 2 x 64 floats with every add/etc instruction now being
>> SIMD. For you Intel folk, think of it as using SSE for everything.
>>     
>
> This is congruent with one of my early designs.  :)
>
>
>   
It 's also similar to one of our earlier design we did consider. An
important fact is that Intel have lot's of silicon. Probably some
operation are done in part by the SSE instruction and completed by the
more GP part of the cpu. I didn't look at the math library, but for some
of the more performing one, I wouldn't be surprised if that would be the
case.
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Reply via email to