2009/9/21 Hugh Fisher <[email protected]>

> Kenneth Ostby wrote:
>
>> If you read through some of the original G80 architectural whitepapers
>> from Nvidia, you can find that they mention that they have found enough
>> scalar instructions in shaders to warrant a change from pure vector
>> processors to scalar ones. Although, this is just an feature set of the
>> underlying architecture. The programmer will still operate on with his
>> vector instructions in his language of choice, and the compiler will
>> just translate them into multiple scalar instructions.
>>
>
> Yes, I read those papers. It may be the best approach for nVidia; I'm
> not convinced it's good for OGA2.
>
> nVidia have, literally, a billion transistors to play with. They really,
> really, want to squeeze every last cycle of performance out of those
> transistors, so are willing to implement massively complicated logic
> to split vector operations across multiple ALUs and fuse the results
> back together afterwards; and do runtime load balancing of shaders
> across internal threads. They have far more resources for design and
> simulation testing, on top of ten years experience building these
> chips.


We are in the same bath as nvidia. We want to squeze all the performance we
can get from the transistor we have. They don't implement and neither are we
logic to split vector operation in scalar one, it's done at the compiler
level. Each vector operation is limited to it's own ALU, What's splited
between multiple ALU is the section of the image to process. The runtime
load balancing is being done by complicated logic and we will also need to
do it. Probably not the same way but software can't cope with the balancing
act to need to be done. Yes they have more resources than we do to design
and simulate their chip. That's why we must keep some part as simple as we
can.  A scalar unit being more easily created and debugged. Once it have
been debugged we can enlarge it easily.


>
>
>  Furthermore, a scalar processor will increase the throughput of the GPU,
>> which is our main concern. In a GPU, single thread performance is really
>> not that important compared to the overall performance. Hence, with a
>> scalar design we can fully utilize the amount cores even in the 25% of
>> the code where we have scalar instructions. The other option would be to
>> work some magic trying to fuse threads into vector ops. when we
>> see multiple scalar instructions, but this seems overly complex.
>>
>
> Going scalar is not free. All the vector operations in the shaders have
> to be emitted as sequences of scalar ops. That's easy, but it does mean
> shader programs get longer and consume more memory. When the scalar ops
> are executed on the GPU, you're doing multiple fetches of effectively
> the same opcode, and multiple 32 bit reads from constant memory/vertex
> data/shader registers instead of one 128 bit read. I suppose you could
> add memory read/write buffering/look-ahead logic to recognise and merge
> sequential memory accesses, but is that going to be any simpler than
> just implementing SIMD?
>

Yes all the vector ops will be emitted as scalar ops. The program do get
longer but scalar ops can be made shorter since we have less instruction to
support. We are doing only one fetch to execute on multiple data since most
of the data is controlled by the same program, we parallelize the data set
but we consider each result independently from the others executed at the
same time(different threads). The organisation of memory would be
essentially the same between a SIMD or our current architecture. Both
require 256 bits memory acces for a add operation and a 128 bits memory
write. Control is also the same the FPGA don't allow memory wider than 32
bits port access with a single memory block. Because of those requirement
either the current architecture or a SIMD one would require 2 memory bloc by
ALU. The connection is mostly wire no read ahead for the data.


>
> AFAIK, historically every attempt to improve the performance of a CPU
> for 3D rendering has started by adding SIMD instructions. (MIPS MDMX,
> Intel SSE, PowerPC AltiVec.) If a lot of smart people all come to the
> same conclusion, I'd be inclined to follow suite.


Those optimization were to improve 3D rendering and scientific processing on
a general purpose processor. You don't have the same requirement and
workload as a GPU. Different problem and context require different solution.
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Reply via email to