On Tue, Sep 22, 2009 at 9:03 PM, Andre Pouliot <[email protected]> wrote:

>
> With the current architecture proposed the load would be the same. Since one
> scalar instruction would be done on at least 4 threads at once.
>
> The architecture would run many kernel(shader programs) at once, on multiple
> threads. So while fetching one instruction your executing 4 threads at once.
> It may seem counter intuitive but the shader will be executing a set of m
> (kernel) * n (threads) at once. The kernel(program) being executed one after
> another. The threads being the dataset to process. Since most threads need
> to execute the same program they are being processed and controlled by the
> same kernel.

I think that we need to drop this mention of MIMD, moving the
explanation to a later section.  Essentially what we're proposing here
is a scalar architecture.  As an after-thought, we recognize that
almost all fragments will use the same kernel and follow the same
instruction sequence.  We can take advantage of this as a way to
optimize the amount of SRAM space used for the instruction
store/cache.  We group N tasks together to share the same instruction
store.  As long as they're all fetching the same instruction, we only
need a single port on the RAM; we send that same instruction down N
independent execution pipelines.  This is nothing more than an AREA
optimization.  Albeit quite a significant one.

> Doing so reduce the need of data communication between the different ALU and
> reduce the dead time present. One vector operation on 1 thread will take
> more time to execute than having a SIMD standard unit, But when executing
> multiple thread controlled by the same kernel the overall time to execute a
> vector operation will be reduced.

Another optimization we make, much more fundamental than MIMD, is that
we execute several tasks round-robin on each pipeline (at instruction
granularity).  (If our MIMD is N wide, and we round-robin M tasks on
each, that's N*M total tasks assigned to the block.)  This is like
Niagara.  We eliminate the need for locks or NOOP instructions to deal
with instruction dependencies.  In fact, the only instructions that
would take long enough to actually stall on a dependency are memory
reads and divides.

We also get some of the benefits of an OOO engine (although not
superscalar) without having to add any special hardware to support it.

-- 
Timothy Normand Miller
http://www.cse.ohio-state.edu/~millerti
Open Graphics Project
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Reply via email to