Kenneth Ostby wrote:
If you read through some of the original G80 architectural whitepapers
from Nvidia, you can find that they mention that they have found enough
scalar instructions in shaders to warrant a change from pure vector
processors to scalar ones. Although, this is just an feature set of the
underlying architecture. The programmer will still operate on with his
vector instructions in his language of choice, and the compiler will
just translate them into multiple scalar instructions.
Yes, I read those papers. It may be the best approach for nVidia; I'm
not convinced it's good for OGA2.
nVidia have, literally, a billion transistors to play with. They really,
really, want to squeeze every last cycle of performance out of those
transistors, so are willing to implement massively complicated logic
to split vector operations across multiple ALUs and fuse the results
back together afterwards; and do runtime load balancing of shaders
across internal threads. They have far more resources for design and
simulation testing, on top of ten years experience building these
chips.
Furthermore, a scalar processor will increase the throughput of the GPU,
which is our main concern. In a GPU, single thread performance is really
not that important compared to the overall performance. Hence, with a
scalar design we can fully utilize the amount cores even in the 25% of
the code where we have scalar instructions. The other option would be to
work some magic trying to fuse threads into vector ops. when we
see multiple scalar instructions, but this seems overly complex.
Going scalar is not free. All the vector operations in the shaders have
to be emitted as sequences of scalar ops. That's easy, but it does mean
shader programs get longer and consume more memory. When the scalar ops
are executed on the GPU, you're doing multiple fetches of effectively
the same opcode, and multiple 32 bit reads from constant memory/vertex
data/shader registers instead of one 128 bit read. I suppose you could
add memory read/write buffering/look-ahead logic to recognise and merge
sequential memory accesses, but is that going to be any simpler than
just implementing SIMD?
AFAIK, historically every attempt to improve the performance of a CPU
for 3D rendering has started by adding SIMD instructions. (MIPS MDMX,
Intel SSE, PowerPC AltiVec.) If a lot of smart people all come to the
same conclusion, I'd be inclined to follow suite.
--
Hugh Fisher
CECS, ANU
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)