On Tue, Sep 22, 2009 at 6:23 PM, <[email protected]> wrote:
> Send Open-graphics mailing list submissions to > [email protected] > > To subscribe or unsubscribe via the World Wide Web, visit > http://lists.duskglow.com/mailman/listinfo/open-graphics > or, via email, send a message with subject or body 'help' to > [email protected] > > You can reach the person managing the list at > [email protected] > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Open-graphics digest..." > > > Today's Topics: > > 1. Re: OGA2 SIMD/MIMD (Hugh Fisher) > 2. Re: OGA2 SIMD/MIMD (Hugh Fisher) > 3. Re: OGA2 SIMD/MIMD (Andre Pouliot) > 4. Re: OGA2 SIMD/MIMD (Andre Pouliot) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Tue, 22 Sep 2009 10:59:55 +1000 > From: Hugh Fisher <[email protected]> > Subject: Re: [Open-graphics] OGA2 SIMD/MIMD > To: ogml <[email protected]> > Message-ID: <[email protected]> > Content-Type: text/plain; charset=ISO-8859-1; format=flowed > > Kenneth Ostby wrote: > > If you read through some of the original G80 architectural whitepapers > > from Nvidia, you can find that they mention that they have found enough > > scalar instructions in shaders to warrant a change from pure vector > > processors to scalar ones. Although, this is just an feature set of the > > underlying architecture. The programmer will still operate on with his > > vector instructions in his language of choice, and the compiler will > > just translate them into multiple scalar instructions. > > Yes, I read those papers. It may be the best approach for nVidia; I'm > not convinced it's good for OGA2. > > nVidia have, literally, a billion transistors to play with. They really, > really, want to squeeze every last cycle of performance out of those > transistors, so are willing to implement massively complicated logic > to split vector operations across multiple ALUs and fuse the results > back together afterwards; and do runtime load balancing of shaders > across internal threads. They have far more resources for design and > simulation testing, on top of ten years experience building these > chips. > > > Furthermore, a scalar processor will increase the throughput of the GPU, > > which is our main concern. In a GPU, single thread performance is really > > not that important compared to the overall performance. Hence, with a > > scalar design we can fully utilize the amount cores even in the 25% of > > the code where we have scalar instructions. The other option would be to > > work some magic trying to fuse threads into vector ops. when we > > see multiple scalar instructions, but this seems overly complex. > > Going scalar is not free. All the vector operations in the shaders have > to be emitted as sequences of scalar ops. That's easy, but it does mean > shader programs get longer and consume more memory. When the scalar ops > are executed on the GPU, you're doing multiple fetches of effectively > the same opcode, and multiple 32 bit reads from constant memory/vertex > data/shader registers instead of one 128 bit read. I suppose you could > add memory read/write buffering/look-ahead logic to recognise and merge > sequential memory accesses, but is that going to be any simpler than > just implementing SIMD? > > AFAIK, historically every attempt to improve the performance of a CPU > for 3D rendering has started by adding SIMD instructions. (MIPS MDMX, > Intel SSE, PowerPC AltiVec.) If a lot of smart people all come to the > same conclusion, I'd be inclined to follow suite. > > -- > Hugh Fisher > CECS, ANU > > > ------------------------------ > > Message: 2 > Date: Tue, 22 Sep 2009 11:41:25 +1000 > From: Hugh Fisher <[email protected]> > Subject: Re: [Open-graphics] OGA2 SIMD/MIMD > To: ogml <[email protected]> > Message-ID: <[email protected]> > Content-Type: text/plain; charset=ISO-8859-1; format=flowed > > Timothy Normand Miller wrote: > > > > Besides the obvious scalar ALU instructions, there are other > > instructions that take bandwidth that are also not vector: flow > > control, loads and stores > > There's lots of those. No? > > No. A vertex shader has to multiply the vertex by the current matrix > which is four 4-way multiplies and four 4-way adds. No branching. > > Standard OpenGL/Direct3D fixed pipeline lighting has one loop, over > the available light sources, say two scalar instructions. Each time > through the loop there's the surface normal multiply by (3x3) matrix > and normalise, thirteen 3-way multiplies/adds and one scalar division. > Lighting equation is eight (maybe more, depending on LIT) 4-way > multiplies/adds and one scalar test & branch. > > Loads and stores are mostly of matrices (eg skinning), or materials > and colors which are one or more 3/4-way RGB/RGBA vectors. > > Loads from texture maps are also vector ops, either RGB/RGBA vectors > or surface normals or other 3/4-way floating point vectors. > > > Also, if memory load instruction latency dominates, then none of this > > matters. Many shader programs will spend most of their time waiting > > on memory, making vector optimizations moot. > > If memory load is important, isn't SIMD faster than fetching and > executing four scalar instructions in succession? > > > I can see vertex shader programs being DP heavy. But there will be > > far fewer vertexes than fragments. How DP-heavy are fragment shader > > programs, generally? > > Vertex processing is more important for CAD type workloads (lots of > wireframes). For all types of 3D, as geometry is tessellated into > smaller polys for more detail, the number of vertices increases relative > to fragments. > > In classic 3D, fragment shaders do texture loads and color multiplies, > all 3/4-way vector ops. Modern fragment shaders implement full lighting > calculations (see above), bump or displacement mapping (vector math), > fogging effects (vector math). Yes they do test and branch as well, > but like most aspects of 3D they are heavy on the vector/matrix maths. > > -- > Hugh Fisher > CECS, ANU > > > ------------------------------ > > Message: 3 > Date: Tue, 22 Sep 2009 21:03:36 -0400 > From: Andre Pouliot <[email protected]> > Subject: Re: [Open-graphics] OGA2 SIMD/MIMD > To: Hugh Fisher <[email protected]> > Cc: ogml <[email protected]> > Message-ID: > <[email protected]> > Content-Type: text/plain; charset="iso-8859-1" > > 2009/9/21 Hugh Fisher <[email protected]> > > > Timothy Normand Miller wrote: > > > >> > >> Besides the obvious scalar ALU instructions, there are other > >> instructions that take bandwidth that are also not vector: flow > >> control, loads and stores > >> There's lots of those. No? > >> > > > > No. A vertex shader has to multiply the vertex by the current matrix > > which is four 4-way multiplies and four 4-way adds. No branching. > > > > Standard OpenGL/Direct3D fixed pipeline lighting has one loop, over > > the available light sources, say two scalar instructions. Each time > > through the loop there's the surface normal multiply by (3x3) matrix > > and normalise, thirteen 3-way multiplies/adds and one scalar division. > > Lighting equation is eight (maybe more, depending on LIT) 4-way > > multiplies/adds and one scalar test & branch. > > > > Loads and stores are mostly of matrices (eg skinning), or materials > > and colors which are one or more 3/4-way RGB/RGBA vectors. > > > > Loads from texture maps are also vector ops, either RGB/RGBA vectors > > or surface normals or other 3/4-way floating point vectors. > > > > Also, if memory load instruction latency dominates, then none of this > >> matters. Many shader programs will spend most of their time waiting > >> on memory, making vector optimizations moot. > >> > > > > If memory load is important, isn't SIMD faster than fetching and > > executing four scalar instructions in succession? > > > With the current architecture proposed the load would be the same. Since > one > scalar instruction would be done on at least 4 threads at once. > > The architecture would run many kernel(shader programs) at once, on > multiple > threads. So while fetching one instruction your executing 4 threads at > once. > It may seem counter intuitive but the shader will be executing a set of m > (kernel) * n (threads) at once. The kernel(program) being executed one > after > another. The threads being the dataset to process. Since most threads need > to execute the same program they are being processed and controlled by the > same kernel. > > Doing so reduce the need of data communication between the different ALU > and > reduce the dead time present. One vector operation on 1 thread will take > more time to execute than having a SIMD standard unit, But when executing > multiple thread controlled by the same kernel the overall time to execute a > vector operation will be reduced. > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: > http://lists.duskglow.com/open-graphics/attachments/20090922/84a2ecf6/attachment-0001.html > > ------------------------------ > > Message: 4 > Date: Tue, 22 Sep 2009 21:23:48 -0400 > From: Andre Pouliot <[email protected]> > Subject: Re: [Open-graphics] OGA2 SIMD/MIMD > To: Hugh Fisher <[email protected]> > Cc: ogml <[email protected]> > Message-ID: > <[email protected]> > Content-Type: text/plain; charset="iso-8859-1" > > 2009/9/21 Hugh Fisher <[email protected]> > > > Kenneth Ostby wrote: > > > >> If you read through some of the original G80 architectural whitepapers > >> from Nvidia, you can find that they mention that they have found enough > >> scalar instructions in shaders to warrant a change from pure vector > >> processors to scalar ones. Although, this is just an feature set of the > >> underlying architecture. The programmer will still operate on with his > >> vector instructions in his language of choice, and the compiler will > >> just translate them into multiple scalar instructions. > >> > > > > Yes, I read those papers. It may be the best approach for nVidia; I'm > > not convinced it's good for OGA2. > > > > nVidia have, literally, a billion transistors to play with. They really, > > really, want to squeeze every last cycle of performance out of those > > transistors, so are willing to implement massively complicated logic > > to split vector operations across multiple ALUs and fuse the results > > back together afterwards; and do runtime load balancing of shaders > > across internal threads. They have far more resources for design and > > simulation testing, on top of ten years experience building these > > chips. > > > We are in the same bath as nvidia. We want to squeze all the performance we > can get from the transistor we have. They don't implement and neither are > we > logic to split vector operation in scalar one, it's done at the compiler > level. Each vector operation is limited to it's own ALU, What's splited > between multiple ALU is the section of the image to process. The runtime > load balancing is being done by complicated logic and we will also need to > do it. Probably not the same way but software can't cope with the balancing > act to need to be done. Yes they have more resources than we do to design > and simulate their chip. That's why we must keep some part as simple as we > can. A scalar unit being more easily created and debugged. Once it have > been debugged we can enlarge it easily. > > > > > > > > Furthermore, a scalar processor will increase the throughput of the GPU, > >> which is our main concern. In a GPU, single thread performance is really > >> not that important compared to the overall performance. Hence, with a > >> scalar design we can fully utilize the amount cores even in the 25% of > >> the code where we have scalar instructions. The other option would be to > >> work some magic trying to fuse threads into vector ops. when we > >> see multiple scalar instructions, but this seems overly complex. > >> > > > > Going scalar is not free. All the vector operations in the shaders have > > to be emitted as sequences of scalar ops. That's easy, but it does mean > > shader programs get longer and consume more memory. When the scalar ops > > are executed on the GPU, you're doing multiple fetches of effectively > > the same opcode, and multiple 32 bit reads from constant memory/vertex > > data/shader registers instead of one 128 bit read. I suppose you could > > add memory read/write buffering/look-ahead logic to recognise and merge > > sequential memory accesses, but is that going to be any simpler than > > just implementing SIMD? > > > > Yes all the vector ops will be emitted as scalar ops. The program do get > longer but scalar ops can be made shorter since we have less instruction to > support. We are doing only one fetch to execute on multiple data since most > of the data is controlled by the same program, we parallelize the data set > but we consider each result independently from the others executed at the > same time(different threads). The organisation of memory would be > essentially the same between a SIMD or our current architecture. Both > require 256 bits memory acces for a add operation and a 128 bits memory > write. Control is also the same the FPGA don't allow memory wider than 32 > bits port access with a single memory block. Because of those requirement > either the current architecture or a SIMD one would require 2 memory bloc > by > ALU. The connection is mostly wire no read ahead for the data. > > > > > > AFAIK, historically every attempt to improve the performance of a CPU > > for 3D rendering has started by adding SIMD instructions. (MIPS MDMX, > > Intel SSE, PowerPC AltiVec.) If a lot of smart people all come to the > > same conclusion, I'd be inclined to follow suite. > > > Those optimization were to improve 3D rendering and scientific processing > on > a general purpose processor. You don't have the same requirement and > workload as a GPU. Different problem and context require different > solution. > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: > http://lists.duskglow.com/open-graphics/attachments/20090922/eb033063/attachment.html > > ------------------------------ > > _______________________________________________ > Open-graphics mailing list > [email protected] > http://lists.duskglow.com/mailman/listinfo/open-graphics > > > End of Open-graphics Digest, Vol 56, Issue 5 > ******************************************** >
_______________________________________________ Open-graphics mailing list [email protected] http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
