Re: [Open-graphics] OGA2 SIMD/MIMD

Timothy Normand Miller Sun, 27 Sep 2009 17:09:58 -0700

I'm way behind on these discussions.  Last week, I had a rebuttal to
do for a paper I submitted to HPCA, so I was preoccupied with that.
I'm sure a lot of what I have to say is obsolete.  My apologies.

On Mon, Sep 21, 2009 at 8:59 PM, Hugh Fisher <[email protected]> wrote:
> Kenneth Ostby wrote:
>>
>> If you read through some of the original G80 architectural whitepapers
>> from Nvidia, you can find that they mention that they have found enough
>> scalar instructions in shaders to warrant a change from pure vector
>> processors to scalar ones. Although, this is just an feature set of the
>> underlying architecture. The programmer will still operate on with his
>> vector instructions in his language of choice, and the compiler will
>> just translate them into multiple scalar instructions.
>
> Yes, I read those papers. It may be the best approach for nVidia; I'm
> not convinced it's good for OGA2.
>
> nVidia have, literally, a billion transistors to play with. They really,
> really, want to squeeze every last cycle of performance out of those
> transistors, so are willing to implement massively complicated logic
> to split vector operations across multiple ALUs and fuse the results
> back together afterwards; and do runtime load balancing of shaders
> across internal threads. They have far more resources for design and
> simulation testing, on top of ten years experience building these
> chips.

And since we don't want to waste resources, we propose generating
scalar code for scalar ALUs and scheduling at task (single fragment or
vertex) granularity.  We'll trade off single thread performance for
higher aggregate performance.  Memory reads will be a major source of
stalls; with plenty of tasks scheduled to each core, when one task
stalls, another is available that wants to perform computations.
Unless everyone's stalled, in which case we're saturating the memory
bandwidth, which is okay.

I've proposed that we do sort first.  If we group rendering into 32x32
blocks, that means we have 1024 tasks to schedule.  That also means
that we don't scale well beyond 1024 / (number of tasks per core)
cores if we're naïve.  Although, in that case, we'd try to overlap
vertex and fragment processing.  Stuff to think about, but later when
it actually matters.  In a Spartan 6 FPGA, we'll be luckly to fit a
handful of engines on it.

My strategy towards future-proofing is to just be as open as possible.
 We will _always_ carefully document the instruction set.  If we come
out with a new architecture that needs a new LLVM backend and
scheduler, people will have prototypes in their hands with plenty of
advance notice.

What works well with a dozen cores is VERY different from what works
well with 1000 cores.  And my goal here is to make something work well
with a dozen cores.  To get good performance with our hardware, game
developers will have to stick to small, tightly-optimized kernels.

>> Furthermore, a scalar processor will increase the throughput of the GPU,
>> which is our main concern. In a GPU, single thread performance is really
>> not that important compared to the overall performance. Hence, with a
>> scalar design we can fully utilize the amount cores even in the 25% of
>> the code where we have scalar instructions. The other option would be to
>> work some magic trying to fuse threads into vector ops. when we
>> see multiple scalar instructions, but this seems overly complex.
>
> Going scalar is not free. All the vector operations in the shaders have
> to be emitted as sequences of scalar ops. That's easy, but it does mean
> shader programs get longer and consume more memory. When the scalar ops
> are executed on the GPU, you're doing multiple fetches of effectively
> the same opcode, and multiple 32 bit reads from constant memory/vertex
> data/shader registers instead of one 128 bit read. I suppose you could
> add memory read/write buffering/look-ahead logic to recognise and merge
> sequential memory accesses, but is that going to be any simpler than
> just implementing SIMD?

Think of our design as a sea of scalar engines.  This MIMD stuff is a
red herring.  It's easier to keep more of the hardware in a scalar
core busy than in a SIMD core.  So rather than wasting space on vector
ALUs, we'll just add more scalar engines.

Since most kernels will follow the same instruction sequence, we'll
track them in groups.  Say we assign 32 tasks to a core.  We can
schedule them in groups of four.  So we start out with 8 groups of
four.  Somewhere along the line, maybe one in a group will diverge
(flow control). We split that off and schedule it separately, so we
have 7 groups of 4, one group of 3, one group of 1.

BTW, there's no reason why we couldn't include vector instructions.
We could have the instruction decode emit a sequence of microops when
encountering one of these.  In terms of programming, if we use an LLVM
front-end, it'll come down to a matter of code size.

>
> AFAIK, historically every attempt to improve the performance of a CPU
> for 3D rendering has started by adding SIMD instructions. (MIPS MDMX,
> Intel SSE, PowerPC AltiVec.) If a lot of smart people all come to the
> same conclusion, I'd be inclined to follow suite.

And this is fine when your possible transistor count is growing faster
than you know what to do with it.  We've had 2GHz processors for a
long time.  What's changed between 250nm and 45nm is some
architectural stuff, but even so, the available area on a die has
outpaced that, so designers just add more L2 cache.  And more cores.
These days, 1/2 the area is L2 cache, even on a dual-core.

Keep in mind that when it comes to CPUs, we're thinking of inherently
single-threaded workloads.  We have aggressive out-of-order
superscalar machines.  A four-issue engine is significantly more than
4x the area and power draw of a single-issue in-order engine.  Think
of how many Atom cores you can run in the same power envelope as a
Core i7.  If you had a sufficiently parallel workload, you'd get more
aggregate performance out of the multiple Atoms than you would out of
the one i7.

That's what we're doing here with the graphics workload. It's
inherently short-run task-based (which is convenient) and inherently
extremely parallel.  We have no NEED to optimize for single-threaded
performance, because we have no want for independent instructions to
execute.

-- 
Timothy Normand Miller
http://www.cse.ohio-state.edu/~millerti
Open Graphics Project
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Re: [Open-graphics] OGA2 SIMD/MIMD

Reply via email to