2009/9/21 Hugh Fisher <[email protected]> > Kenneth Ostby wrote: > >> If you read through some of the original G80 architectural whitepapers >> from Nvidia, you can find that they mention that they have found enough >> scalar instructions in shaders to warrant a change from pure vector >> processors to scalar ones. Although, this is just an feature set of the >> underlying architecture. The programmer will still operate on with his >> vector instructions in his language of choice, and the compiler will >> just translate them into multiple scalar instructions. >> > > Yes, I read those papers. It may be the best approach for nVidia; I'm > not convinced it's good for OGA2. > > nVidia have, literally, a billion transistors to play with. They really, > really, want to squeeze every last cycle of performance out of those > transistors, so are willing to implement massively complicated logic > to split vector operations across multiple ALUs and fuse the results > back together afterwards; and do runtime load balancing of shaders > across internal threads. They have far more resources for design and > simulation testing, on top of ten years experience building these > chips.
We are in the same bath as nvidia. We want to squeze all the performance we can get from the transistor we have. They don't implement and neither are we logic to split vector operation in scalar one, it's done at the compiler level. Each vector operation is limited to it's own ALU, What's splited between multiple ALU is the section of the image to process. The runtime load balancing is being done by complicated logic and we will also need to do it. Probably not the same way but software can't cope with the balancing act to need to be done. Yes they have more resources than we do to design and simulate their chip. That's why we must keep some part as simple as we can. A scalar unit being more easily created and debugged. Once it have been debugged we can enlarge it easily. > > > Furthermore, a scalar processor will increase the throughput of the GPU, >> which is our main concern. In a GPU, single thread performance is really >> not that important compared to the overall performance. Hence, with a >> scalar design we can fully utilize the amount cores even in the 25% of >> the code where we have scalar instructions. The other option would be to >> work some magic trying to fuse threads into vector ops. when we >> see multiple scalar instructions, but this seems overly complex. >> > > Going scalar is not free. All the vector operations in the shaders have > to be emitted as sequences of scalar ops. That's easy, but it does mean > shader programs get longer and consume more memory. When the scalar ops > are executed on the GPU, you're doing multiple fetches of effectively > the same opcode, and multiple 32 bit reads from constant memory/vertex > data/shader registers instead of one 128 bit read. I suppose you could > add memory read/write buffering/look-ahead logic to recognise and merge > sequential memory accesses, but is that going to be any simpler than > just implementing SIMD? > Yes all the vector ops will be emitted as scalar ops. The program do get longer but scalar ops can be made shorter since we have less instruction to support. We are doing only one fetch to execute on multiple data since most of the data is controlled by the same program, we parallelize the data set but we consider each result independently from the others executed at the same time(different threads). The organisation of memory would be essentially the same between a SIMD or our current architecture. Both require 256 bits memory acces for a add operation and a 128 bits memory write. Control is also the same the FPGA don't allow memory wider than 32 bits port access with a single memory block. Because of those requirement either the current architecture or a SIMD one would require 2 memory bloc by ALU. The connection is mostly wire no read ahead for the data. > > AFAIK, historically every attempt to improve the performance of a CPU > for 3D rendering has started by adding SIMD instructions. (MIPS MDMX, > Intel SSE, PowerPC AltiVec.) If a lot of smart people all come to the > same conclusion, I'd be inclined to follow suite. Those optimization were to improve 3D rendering and scientific processing on a general purpose processor. You don't have the same requirement and workload as a GPU. Different problem and context require different solution.
_______________________________________________ Open-graphics mailing list [email protected] http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
