On Tue, Sep 22, 2009 at 6:23 PM, <[email protected]> wrote:

> Send Open-graphics mailing list submissions to
>        [email protected]
>
> To subscribe or unsubscribe via the World Wide Web, visit
>        http://lists.duskglow.com/mailman/listinfo/open-graphics
> or, via email, send a message with subject or body 'help' to
>        [email protected]
>
> You can reach the person managing the list at
>        [email protected]
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Open-graphics digest..."
>
>
> Today's Topics:
>
>   1. Re: OGA2 SIMD/MIMD (Hugh Fisher)
>   2. Re: OGA2 SIMD/MIMD (Hugh Fisher)
>   3. Re: OGA2 SIMD/MIMD (Andre Pouliot)
>   4. Re: OGA2 SIMD/MIMD (Andre Pouliot)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Tue, 22 Sep 2009 10:59:55 +1000
> From: Hugh Fisher <[email protected]>
> Subject: Re: [Open-graphics] OGA2 SIMD/MIMD
> To: ogml <[email protected]>
> Message-ID: <[email protected]>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
> Kenneth Ostby wrote:
> > If you read through some of the original G80 architectural whitepapers
> > from Nvidia, you can find that they mention that they have found enough
> > scalar instructions in shaders to warrant a change from pure vector
> > processors to scalar ones. Although, this is just an feature set of the
> > underlying architecture. The programmer will still operate on with his
> > vector instructions in his language of choice, and the compiler will
> > just translate them into multiple scalar instructions.
>
> Yes, I read those papers. It may be the best approach for nVidia; I'm
> not convinced it's good for OGA2.
>
> nVidia have, literally, a billion transistors to play with. They really,
> really, want to squeeze every last cycle of performance out of those
> transistors, so are willing to implement massively complicated logic
> to split vector operations across multiple ALUs and fuse the results
> back together afterwards; and do runtime load balancing of shaders
> across internal threads. They have far more resources for design and
> simulation testing, on top of ten years experience building these
> chips.
>
> > Furthermore, a scalar processor will increase the throughput of the GPU,
> > which is our main concern. In a GPU, single thread performance is really
> > not that important compared to the overall performance. Hence, with a
> > scalar design we can fully utilize the amount cores even in the 25% of
> > the code where we have scalar instructions. The other option would be to
> > work some magic trying to fuse threads into vector ops. when we
> > see multiple scalar instructions, but this seems overly complex.
>
> Going scalar is not free. All the vector operations in the shaders have
> to be emitted as sequences of scalar ops. That's easy, but it does mean
> shader programs get longer and consume more memory. When the scalar ops
> are executed on the GPU, you're doing multiple fetches of effectively
> the same opcode, and multiple 32 bit reads from constant memory/vertex
> data/shader registers instead of one 128 bit read. I suppose you could
> add memory read/write buffering/look-ahead logic to recognise and merge
> sequential memory accesses, but is that going to be any simpler than
> just implementing SIMD?
>
> AFAIK, historically every attempt to improve the performance of a CPU
> for 3D rendering has started by adding SIMD instructions. (MIPS MDMX,
> Intel SSE, PowerPC AltiVec.) If a lot of smart people all come to the
> same conclusion, I'd be inclined to follow suite.
>
> --
>        Hugh Fisher
>        CECS, ANU
>
>
> ------------------------------
>
> Message: 2
> Date: Tue, 22 Sep 2009 11:41:25 +1000
> From: Hugh Fisher <[email protected]>
> Subject: Re: [Open-graphics] OGA2 SIMD/MIMD
> To: ogml <[email protected]>
> Message-ID: <[email protected]>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
> Timothy Normand Miller wrote:
> >
> > Besides the obvious scalar ALU instructions, there are other
> > instructions that take bandwidth that are also not vector:  flow
> > control, loads and stores
> > There's lots of those.  No?
>
> No. A vertex shader has to multiply the vertex by the current matrix
> which is four 4-way multiplies and four 4-way adds. No branching.
>
> Standard OpenGL/Direct3D fixed pipeline lighting has one loop, over
> the available light sources, say two scalar instructions. Each time
> through the loop there's the surface normal multiply by (3x3) matrix
> and normalise, thirteen 3-way multiplies/adds and one scalar division.
> Lighting equation is eight (maybe more, depending on LIT) 4-way
> multiplies/adds and one scalar test & branch.
>
> Loads and stores are mostly of matrices (eg skinning), or materials
> and colors which are one or more 3/4-way RGB/RGBA vectors.
>
> Loads from texture maps are also vector ops, either RGB/RGBA vectors
> or surface normals or other 3/4-way floating point vectors.
>
> > Also, if memory load instruction latency dominates, then none of this
> > matters.  Many shader programs will spend most of their time waiting
> > on memory, making vector optimizations moot.
>
> If memory load is important, isn't SIMD faster than fetching and
> executing four scalar instructions in succession?
>
> > I can see vertex shader programs being DP heavy.  But there will be
> > far fewer vertexes than fragments.  How DP-heavy are fragment shader
> > programs, generally?
>
> Vertex processing is more important for CAD type workloads (lots of
> wireframes). For all types of 3D, as geometry is tessellated into
> smaller polys for more detail, the number of vertices increases relative
> to fragments.
>
> In classic 3D, fragment shaders do texture loads and color multiplies,
> all 3/4-way vector ops. Modern fragment shaders implement full lighting
> calculations (see above), bump or displacement mapping (vector math),
> fogging effects (vector math). Yes they do test and branch as well,
> but like most aspects of 3D they are heavy on the vector/matrix maths.
>
> --
>        Hugh Fisher
>        CECS, ANU
>
>
> ------------------------------
>
> Message: 3
> Date: Tue, 22 Sep 2009 21:03:36 -0400
> From: Andre Pouliot <[email protected]>
> Subject: Re: [Open-graphics] OGA2 SIMD/MIMD
> To: Hugh Fisher <[email protected]>
> Cc: ogml <[email protected]>
> Message-ID:
>        <[email protected]>
> Content-Type: text/plain; charset="iso-8859-1"
>
> 2009/9/21 Hugh Fisher <[email protected]>
>
> > Timothy Normand Miller wrote:
> >
> >>
> >> Besides the obvious scalar ALU instructions, there are other
> >> instructions that take bandwidth that are also not vector:  flow
> >> control, loads and stores
> >> There's lots of those.  No?
> >>
> >
> > No. A vertex shader has to multiply the vertex by the current matrix
> > which is four 4-way multiplies and four 4-way adds. No branching.
> >
> > Standard OpenGL/Direct3D fixed pipeline lighting has one loop, over
> > the available light sources, say two scalar instructions. Each time
> > through the loop there's the surface normal multiply by (3x3) matrix
> > and normalise, thirteen 3-way multiplies/adds and one scalar division.
> > Lighting equation is eight (maybe more, depending on LIT) 4-way
> > multiplies/adds and one scalar test & branch.
> >
> > Loads and stores are mostly of matrices (eg skinning), or materials
> > and colors which are one or more 3/4-way RGB/RGBA vectors.
> >
> > Loads from texture maps are also vector ops, either RGB/RGBA vectors
> > or surface normals or other 3/4-way floating point vectors.
> >
> >  Also, if memory load instruction latency dominates, then none of this
> >> matters.  Many shader programs will spend most of their time waiting
> >> on memory, making vector optimizations moot.
> >>
> >
> > If memory load is important, isn't SIMD faster than fetching and
> > executing four scalar instructions in succession?
>
>
> With the current architecture proposed the load would be the same. Since
> one
> scalar instruction would be done on at least 4 threads at once.
>
> The architecture would run many kernel(shader programs) at once, on
> multiple
> threads. So while fetching one instruction your executing 4 threads at
> once.
> It may seem counter intuitive but the shader will be executing a set of m
> (kernel) * n (threads) at once. The kernel(program) being executed one
> after
> another. The threads being the dataset to process. Since most threads need
> to execute the same program they are being processed and controlled by the
> same kernel.
>
> Doing so reduce the need of data communication between the different ALU
> and
> reduce the dead time present. One vector operation on 1 thread will take
> more time to execute than having a SIMD standard unit, But when executing
> multiple thread controlled by the same kernel the overall time to execute a
> vector operation will be reduced.
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL:
> http://lists.duskglow.com/open-graphics/attachments/20090922/84a2ecf6/attachment-0001.html
>
> ------------------------------
>
> Message: 4
> Date: Tue, 22 Sep 2009 21:23:48 -0400
> From: Andre Pouliot <[email protected]>
> Subject: Re: [Open-graphics] OGA2 SIMD/MIMD
> To: Hugh Fisher <[email protected]>
> Cc: ogml <[email protected]>
> Message-ID:
>        <[email protected]>
> Content-Type: text/plain; charset="iso-8859-1"
>
> 2009/9/21 Hugh Fisher <[email protected]>
>
> > Kenneth Ostby wrote:
> >
> >> If you read through some of the original G80 architectural whitepapers
> >> from Nvidia, you can find that they mention that they have found enough
> >> scalar instructions in shaders to warrant a change from pure vector
> >> processors to scalar ones. Although, this is just an feature set of the
> >> underlying architecture. The programmer will still operate on with his
> >> vector instructions in his language of choice, and the compiler will
> >> just translate them into multiple scalar instructions.
> >>
> >
> > Yes, I read those papers. It may be the best approach for nVidia; I'm
> > not convinced it's good for OGA2.
> >
> > nVidia have, literally, a billion transistors to play with. They really,
> > really, want to squeeze every last cycle of performance out of those
> > transistors, so are willing to implement massively complicated logic
> > to split vector operations across multiple ALUs and fuse the results
> > back together afterwards; and do runtime load balancing of shaders
> > across internal threads. They have far more resources for design and
> > simulation testing, on top of ten years experience building these
> > chips.
>
>
> We are in the same bath as nvidia. We want to squeze all the performance we
> can get from the transistor we have. They don't implement and neither are
> we
> logic to split vector operation in scalar one, it's done at the compiler
> level. Each vector operation is limited to it's own ALU, What's splited
> between multiple ALU is the section of the image to process. The runtime
> load balancing is being done by complicated logic and we will also need to
> do it. Probably not the same way but software can't cope with the balancing
> act to need to be done. Yes they have more resources than we do to design
> and simulate their chip. That's why we must keep some part as simple as we
> can.  A scalar unit being more easily created and debugged. Once it have
> been debugged we can enlarge it easily.
>
>
> >
> >
> >  Furthermore, a scalar processor will increase the throughput of the GPU,
> >> which is our main concern. In a GPU, single thread performance is really
> >> not that important compared to the overall performance. Hence, with a
> >> scalar design we can fully utilize the amount cores even in the 25% of
> >> the code where we have scalar instructions. The other option would be to
> >> work some magic trying to fuse threads into vector ops. when we
> >> see multiple scalar instructions, but this seems overly complex.
> >>
> >
> > Going scalar is not free. All the vector operations in the shaders have
> > to be emitted as sequences of scalar ops. That's easy, but it does mean
> > shader programs get longer and consume more memory. When the scalar ops
> > are executed on the GPU, you're doing multiple fetches of effectively
> > the same opcode, and multiple 32 bit reads from constant memory/vertex
> > data/shader registers instead of one 128 bit read. I suppose you could
> > add memory read/write buffering/look-ahead logic to recognise and merge
> > sequential memory accesses, but is that going to be any simpler than
> > just implementing SIMD?
> >
>
> Yes all the vector ops will be emitted as scalar ops. The program do get
> longer but scalar ops can be made shorter since we have less instruction to
> support. We are doing only one fetch to execute on multiple data since most
> of the data is controlled by the same program, we parallelize the data set
> but we consider each result independently from the others executed at the
> same time(different threads). The organisation of memory would be
> essentially the same between a SIMD or our current architecture. Both
> require 256 bits memory acces for a add operation and a 128 bits memory
> write. Control is also the same the FPGA don't allow memory wider than 32
> bits port access with a single memory block. Because of those requirement
> either the current architecture or a SIMD one would require 2 memory bloc
> by
> ALU. The connection is mostly wire no read ahead for the data.
>
>
> >
> > AFAIK, historically every attempt to improve the performance of a CPU
> > for 3D rendering has started by adding SIMD instructions. (MIPS MDMX,
> > Intel SSE, PowerPC AltiVec.) If a lot of smart people all come to the
> > same conclusion, I'd be inclined to follow suite.
>
>
> Those optimization were to improve 3D rendering and scientific processing
> on
> a general purpose processor. You don't have the same requirement and
> workload as a GPU. Different problem and context require different
> solution.
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL:
> http://lists.duskglow.com/open-graphics/attachments/20090922/eb033063/attachment.html
>
> ------------------------------
>
> _______________________________________________
> Open-graphics mailing list
> [email protected]
> http://lists.duskglow.com/mailman/listinfo/open-graphics
>
>
> End of Open-graphics Digest, Vol 56, Issue 5
> ********************************************
>
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Reply via email to