Timothy Normand Miller wrote: > On Mon, Sep 21, 2009 at 12:17 AM, Hugh Fisher <[email protected]> wrote: > >> Timothy Normand Miller wrote: >> >>> One of the design details that seems to be hard to present is the MIMD >>> architecture. At first glance, it looks like a SIMD architecture. >>> But all of you are right to point out that shader workloads are >>> primarily scalar. >>> >> I'd like to see some evidence for this. >> > > A challenge! :) People keep telling me that they're primarily scalar > workloads. I have accepted what they say. It may be that > well-written shader programs are heavily vector but that typical > shader programs written by typical programmers are not. > > Besides the obvious scalar ALU instructions, there are other > instructions that take bandwidth that are also not vector: flow > control, loads and stores > There's lots of those. No? > > Also, if memory load instruction latency dominates, then none of this > matters. Many shader programs will spend most of their time waiting > on memory, making vector optimizations moot. > For the memory access we are still considering how to do them. Since each ALU target a different thread. One of the possibility is only allow indirect memory access. Using a register as a data pointer. It's not that efficient but it simplify a lot on how to access the memory for each thread.
Also don't forget that a good chunk of memory access would normally be for constant. There's going to be a constant register file accessible by the ALU in read only mode. That would reduce the memory requirement and liberate more work register for each ALU. > >> Some years ago I wrote a bunch of demonstration GPU shader programs >> in low level ARB/nVidia assembly. You can still find them at: >> <http://cs.anu.edu.au/~Hugh.Fisher/3dstuff/lowlevel.html> >> >> 80% of the instructions are vector, only 20% scalar. The ratio of >> scalar instructions increases very slightly with the more complex >> shaders to perhaps 25%. The single most common instruction is DP, >> Dot Product, of three or four operands from a vertex/color/matrix. >> > > I can see vertex shader programs being DP heavy. But there will be > far fewer vertexes than fragments. How DP-heavy are fragment shader > programs, generally? > > In either case even if the programmer use a lot of vector math, all vector math can be reduced to it's scalar component. Most vector operation in how they work are a lot of small scalar operation and some part must be done serially. It's more practical in term of chip space to broke them done as scalar operation. The high level representation may be a DP but in modern GPU this DP will be broken down in a scalar operation to saturate the pipeline. I'm not sure if they implemented that operation as software or microcode. Making true vector alu would probably accelerate some operation but by doing so you reduce the mean use of all the alu. If we do a short evaluation of use : We get from 20% scalar we use 25% of our alu Posing an even split between the 3 and 4 operand vertex operation that bring us to 40% at 3 operation(75%) and 40% at 4 operand (100%) That give us a common use of : 0.2*0.25 + 0.4 * .75 + 0.4*1 = 75% of saturation of the resource available. That amount isn't so bad but that still mean that for every 4 operation we do we waste 1. Also that 75% is by supposing that we do an operation like a dot product in 1 pass who is unlikely considering the hardware requirement. To be practical a dot product would probably require to do it in 3 pass : 1 multiply and 2 add. Each of those operation with a diminishing return of activity for each alu for a 4 operand dot product it would fall to : mult(100%) + add for 2 (50%) + final add(25%) = 58.3% for the saturation of resource. Those 2 sample are part of the reason why we target more a MIMD doing scalar operation. We reduce the idle time for each ALU. A vector operation will take more time to do. But we can run more thread and operation without that idle time being present. We did consider a vector operation processor at one point, but we rejected that idea after considerable debate. Another point was possibly doing gathering of multiple thread to optimize the ALU use. That would augment the complexity of the instruction scheduler, probably to much to be realizable with what we have as resource. >> If you're using shaders to emulate the original fixed function >> OpenGL/Direct3D pipelines, the ratio of SIMD to scalar will be >> even higher. >> >> OK, my shaders are old, and predate Shader Model 3.0 and widespread >> use of high level languages. They still do what every 3D engine >> spends most of its time doing: multiply vertices by a matrix, and >> RGB/RGBA colors by other colors. >> >> I'm happy to be proved wrong on this, but let's do so on the basis >> of real world shaders written by graphics programmers. >> > > Yeah. I agree. I don't know enough about this myself. We need to do > this right without egos or too much guessing. Others here should be > able to fill in the gaps. > > >> I'd suggest a MIPS with each floating point reg extended to 128 bits >> as 4 x 32 / 2 x 64 floats with every add/etc instruction now being >> SIMD. For you Intel folk, think of it as using SSE for everything. >> > > This is congruent with one of my early designs. :) > > > It 's also similar to one of our earlier design we did consider. An important fact is that Intel have lot's of silicon. Probably some operation are done in part by the SSE instruction and completed by the more GP part of the cpu. I didn't look at the math library, but for some of the more performing one, I wouldn't be surprised if that would be the case.
_______________________________________________ Open-graphics mailing list [email protected] http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
