I'm way behind on these discussions. Last week, I had a rebuttal to do for a paper I submitted to HPCA, so I was preoccupied with that. I'm sure a lot of what I have to say is obsolete. My apologies.
On Mon, Sep 21, 2009 at 8:59 PM, Hugh Fisher <[email protected]> wrote: > Kenneth Ostby wrote: >> >> If you read through some of the original G80 architectural whitepapers >> from Nvidia, you can find that they mention that they have found enough >> scalar instructions in shaders to warrant a change from pure vector >> processors to scalar ones. Although, this is just an feature set of the >> underlying architecture. The programmer will still operate on with his >> vector instructions in his language of choice, and the compiler will >> just translate them into multiple scalar instructions. > > Yes, I read those papers. It may be the best approach for nVidia; I'm > not convinced it's good for OGA2. > > nVidia have, literally, a billion transistors to play with. They really, > really, want to squeeze every last cycle of performance out of those > transistors, so are willing to implement massively complicated logic > to split vector operations across multiple ALUs and fuse the results > back together afterwards; and do runtime load balancing of shaders > across internal threads. They have far more resources for design and > simulation testing, on top of ten years experience building these > chips. And since we don't want to waste resources, we propose generating scalar code for scalar ALUs and scheduling at task (single fragment or vertex) granularity. We'll trade off single thread performance for higher aggregate performance. Memory reads will be a major source of stalls; with plenty of tasks scheduled to each core, when one task stalls, another is available that wants to perform computations. Unless everyone's stalled, in which case we're saturating the memory bandwidth, which is okay. I've proposed that we do sort first. If we group rendering into 32x32 blocks, that means we have 1024 tasks to schedule. That also means that we don't scale well beyond 1024 / (number of tasks per core) cores if we're naïve. Although, in that case, we'd try to overlap vertex and fragment processing. Stuff to think about, but later when it actually matters. In a Spartan 6 FPGA, we'll be luckly to fit a handful of engines on it. My strategy towards future-proofing is to just be as open as possible. We will _always_ carefully document the instruction set. If we come out with a new architecture that needs a new LLVM backend and scheduler, people will have prototypes in their hands with plenty of advance notice. What works well with a dozen cores is VERY different from what works well with 1000 cores. And my goal here is to make something work well with a dozen cores. To get good performance with our hardware, game developers will have to stick to small, tightly-optimized kernels. >> Furthermore, a scalar processor will increase the throughput of the GPU, >> which is our main concern. In a GPU, single thread performance is really >> not that important compared to the overall performance. Hence, with a >> scalar design we can fully utilize the amount cores even in the 25% of >> the code where we have scalar instructions. The other option would be to >> work some magic trying to fuse threads into vector ops. when we >> see multiple scalar instructions, but this seems overly complex. > > Going scalar is not free. All the vector operations in the shaders have > to be emitted as sequences of scalar ops. That's easy, but it does mean > shader programs get longer and consume more memory. When the scalar ops > are executed on the GPU, you're doing multiple fetches of effectively > the same opcode, and multiple 32 bit reads from constant memory/vertex > data/shader registers instead of one 128 bit read. I suppose you could > add memory read/write buffering/look-ahead logic to recognise and merge > sequential memory accesses, but is that going to be any simpler than > just implementing SIMD? Think of our design as a sea of scalar engines. This MIMD stuff is a red herring. It's easier to keep more of the hardware in a scalar core busy than in a SIMD core. So rather than wasting space on vector ALUs, we'll just add more scalar engines. Since most kernels will follow the same instruction sequence, we'll track them in groups. Say we assign 32 tasks to a core. We can schedule them in groups of four. So we start out with 8 groups of four. Somewhere along the line, maybe one in a group will diverge (flow control). We split that off and schedule it separately, so we have 7 groups of 4, one group of 3, one group of 1. BTW, there's no reason why we couldn't include vector instructions. We could have the instruction decode emit a sequence of microops when encountering one of these. In terms of programming, if we use an LLVM front-end, it'll come down to a matter of code size. > > AFAIK, historically every attempt to improve the performance of a CPU > for 3D rendering has started by adding SIMD instructions. (MIPS MDMX, > Intel SSE, PowerPC AltiVec.) If a lot of smart people all come to the > same conclusion, I'd be inclined to follow suite. And this is fine when your possible transistor count is growing faster than you know what to do with it. We've had 2GHz processors for a long time. What's changed between 250nm and 45nm is some architectural stuff, but even so, the available area on a die has outpaced that, so designers just add more L2 cache. And more cores. These days, 1/2 the area is L2 cache, even on a dual-core. Keep in mind that when it comes to CPUs, we're thinking of inherently single-threaded workloads. We have aggressive out-of-order superscalar machines. A four-issue engine is significantly more than 4x the area and power draw of a single-issue in-order engine. Think of how many Atom cores you can run in the same power envelope as a Core i7. If you had a sufficiently parallel workload, you'd get more aggregate performance out of the multiple Atoms than you would out of the one i7. That's what we're doing here with the graphics workload. It's inherently short-run task-based (which is convenient) and inherently extremely parallel. We have no NEED to optimize for single-threaded performance, because we have no want for independent instructions to execute. -- Timothy Normand Miller http://www.cse.ohio-state.edu/~millerti Open Graphics Project _______________________________________________ Open-graphics mailing list [email protected] http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
