On 2009-09-27, Timothy Normand Miller wrote:
> On Tue, Sep 22, 2009 at 9:03 PM, Andre Pouliot <[email protected]> 
> wrote:
> 
> >
> > With the current architecture proposed the load would be the same. Since one
> > scalar instruction would be done on at least 4 threads at once.
> >
> > The architecture would run many kernel(shader programs) at once, on multiple
> > threads. So while fetching one instruction your executing 4 threads at once.
> > It may seem counter intuitive but the shader will be executing a set of m
> > (kernel) * n (threads) at once. The kernel(program) being executed one after
> > another. The threads being the dataset to process. Since most threads need
> > to execute the same program they are being processed and controlled by the
> > same kernel.
> 
> I think that we need to drop this mention of MIMD, moving the
> explanation to a later section.  Essentially what we're proposing here
> is a scalar architecture.  As an after-thought, we recognize that
> almost all fragments will use the same kernel and follow the same
> instruction sequence.  We can take advantage of this as a way to
> optimize the amount of SRAM space used for the instruction
> store/cache.  We group N tasks together to share the same instruction
> store.  As long as they're all fetching the same instruction, we only
> need a single port on the RAM; we send that same instruction down N
> independent execution pipelines.  This is nothing more than an AREA
> optimization.  Albeit quite a significant one.

The fragments will start off following the same instruction sequence,
but what happens after a few branches?  For minor forward branches we
might disable write-back, but how do we deal with the remaining
essential branches?  Naively it would groups would split exponentially
leaving us with groups of single threads and an 1/N utilisation.

We could think about a mechanism to recombine threads into full groups,
but can we do that with less overhead than what we save by grouping
threads in the first place?  How much do we save by this optimisation
compared to the area of the floating point multiplier and adder in the
ALU itself?  If we can't keep the groups complete, we're waisting that
area instead.

> > Doing so reduce the need of data communication between the different ALU and
> > reduce the dead time present. One vector operation on 1 thread will take
> > more time to execute than having a SIMD standard unit, But when executing
> > multiple thread controlled by the same kernel the overall time to execute a
> > vector operation will be reduced.
> 
> Another optimization we make, much more fundamental than MIMD, is that
> we execute several tasks round-robin on each pipeline (at instruction
> granularity).  (If our MIMD is N wide, and we round-robin M tasks on
> each, that's N*M total tasks assigned to the block.)  This is like
> Niagara.  We eliminate the need for locks or NOOP instructions to deal
> with instruction dependencies.  In fact, the only instructions that
> would take long enough to actually stall on a dependency are memory
> reads and divides.

If I understand this correctly, it means that all pipelines will hit a
certain memory instruction simultaneously, after which M parallel units
will try to load/store for N consecutive cycles.

A possible fix would be to let each pipeline run N cycles behind their
neighbour.  The first one reads instruction words from the store and
streams them on to the next pipeline.  Still, there is the problem that
when the kernel hits a load/store there will be N*M consecutive cycles
where all N*M threads stall due to a single one of those threads.

The proposed architecture is nice given a mostly linear flow of
instructions which only use local memory, but can deal with the more
general case effectively?  If threads were much more lightweight, it
would seem easier to come up with a solution.
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Reply via email to