2009/9/21 Hugh Fisher <[email protected]> > Timothy Normand Miller wrote: > >> >> Besides the obvious scalar ALU instructions, there are other >> instructions that take bandwidth that are also not vector: flow >> control, loads and stores >> There's lots of those. No? >> > > No. A vertex shader has to multiply the vertex by the current matrix > which is four 4-way multiplies and four 4-way adds. No branching. > > Standard OpenGL/Direct3D fixed pipeline lighting has one loop, over > the available light sources, say two scalar instructions. Each time > through the loop there's the surface normal multiply by (3x3) matrix > and normalise, thirteen 3-way multiplies/adds and one scalar division. > Lighting equation is eight (maybe more, depending on LIT) 4-way > multiplies/adds and one scalar test & branch. > > Loads and stores are mostly of matrices (eg skinning), or materials > and colors which are one or more 3/4-way RGB/RGBA vectors. > > Loads from texture maps are also vector ops, either RGB/RGBA vectors > or surface normals or other 3/4-way floating point vectors. > > Also, if memory load instruction latency dominates, then none of this >> matters. Many shader programs will spend most of their time waiting >> on memory, making vector optimizations moot. >> > > If memory load is important, isn't SIMD faster than fetching and > executing four scalar instructions in succession?
With the current architecture proposed the load would be the same. Since one scalar instruction would be done on at least 4 threads at once. The architecture would run many kernel(shader programs) at once, on multiple threads. So while fetching one instruction your executing 4 threads at once. It may seem counter intuitive but the shader will be executing a set of m (kernel) * n (threads) at once. The kernel(program) being executed one after another. The threads being the dataset to process. Since most threads need to execute the same program they are being processed and controlled by the same kernel. Doing so reduce the need of data communication between the different ALU and reduce the dead time present. One vector operation on 1 thread will take more time to execute than having a SIMD standard unit, But when executing multiple thread controlled by the same kernel the overall time to execute a vector operation will be reduced.
_______________________________________________ Open-graphics mailing list [email protected] http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
