On Sat, Oct 10, 2009 at 6:10 AM, Petter Urkedal <[email protected]> wrote:
> On 2009-09-27, Timothy Normand Miller wrote:
>> On Tue, Sep 22, 2009 at 9:03 PM, Andre Pouliot <[email protected]> 
>> wrote:
>>
>> >
>> > With the current architecture proposed the load would be the same. Since 
>> > one
>> > scalar instruction would be done on at least 4 threads at once.
>> >
>> > The architecture would run many kernel(shader programs) at once, on 
>> > multiple
>> > threads. So while fetching one instruction your executing 4 threads at 
>> > once.
>> > It may seem counter intuitive but the shader will be executing a set of m
>> > (kernel) * n (threads) at once. The kernel(program) being executed one 
>> > after
>> > another. The threads being the dataset to process. Since most threads need
>> > to execute the same program they are being processed and controlled by the
>> > same kernel.
>>
>> I think that we need to drop this mention of MIMD, moving the
>> explanation to a later section.  Essentially what we're proposing here
>> is a scalar architecture.  As an after-thought, we recognize that
>> almost all fragments will use the same kernel and follow the same
>> instruction sequence.  We can take advantage of this as a way to
>> optimize the amount of SRAM space used for the instruction
>> store/cache.  We group N tasks together to share the same instruction
>> store.  As long as they're all fetching the same instruction, we only
>> need a single port on the RAM; we send that same instruction down N
>> independent execution pipelines.  This is nothing more than an AREA
>> optimization.  Albeit quite a significant one.
>
> The fragments will start off following the same instruction sequence,
> but what happens after a few branches?  For minor forward branches we
> might disable write-back, but how do we deal with the remaining
> essential branches?  Naively it would groups would split exponentially
> leaving us with groups of single threads and an 1/N utilisation.

I have two answers to that.  (1) except for unusual edge (literally)
cases, the instruction flow will almost always be identical, so the
splitting won't happen very often.  (2) divergence (however we handle
it) will increase random access for reading textures and other
surfaces, decreasing the effective memory bandwidth; as long as memory
is the bottleneck, we don't care about inefficiency in the shaders.

Of course, this qualitative analysis is no substitute for hard
experimental data, but the first rev of our design is going to involve
a lot of guesswork.  There's no way we're going to be able to predict
in advance what is the optimal design, so we just have to take our
best guess and move forward.

> We could think about a mechanism to recombine threads into full groups,
> but can we do that with less overhead than what we save by grouping
> threads in the first place?  How much do we save by this optimisation
> compared to the area of the floating point multiplier and adder in the
> ALU itself?  If we can't keep the groups complete, we're waisting that
> area instead.

IIRC, I had said that a shader would require 4.25 BRAMs.  If we don't
do this optimization, each shader would run 8 threads round-robin and
require 5 BRAMs.  Given, say, 300 BRAMs, we have the choice between 60
and 66 shaders.  So you may be right about this.  We squeeze in 6
extra shaders, but with the potential to make some of them much less
efficient.

In light of your comment, I say we ditch the icache space optimization
for now.  Or at most, we might consider feeding TWO pipelines from the
same icache, since the BRAMs are dual-ported, but we can even make
that an afterthought.

>> > Doing so reduce the need of data communication between the different ALU 
>> > and
>> > reduce the dead time present. One vector operation on 1 thread will take
>> > more time to execute than having a SIMD standard unit, But when executing
>> > multiple thread controlled by the same kernel the overall time to execute a
>> > vector operation will be reduced.
>>
>> Another optimization we make, much more fundamental than MIMD, is that
>> we execute several tasks round-robin on each pipeline (at instruction
>> granularity).  (If our MIMD is N wide, and we round-robin M tasks on
>> each, that's N*M total tasks assigned to the block.)  This is like
>> Niagara.  We eliminate the need for locks or NOOP instructions to deal
>> with instruction dependencies.  In fact, the only instructions that
>> would take long enough to actually stall on a dependency are memory
>> reads and divides.
>
> If I understand this correctly, it means that all pipelines will hit a
> certain memory instruction simultaneously, after which M parallel units
> will try to load/store for N consecutive cycles.
>
> A possible fix would be to let each pipeline run N cycles behind their
> neighbour.  The first one reads instruction words from the store and
> streams them on to the next pipeline.  Still, there is the problem that
> when the kernel hits a load/store there will be N*M consecutive cycles
> where all N*M threads stall due to a single one of those threads.

And moreover, we need to think about how many independent paths there
will be to the global dcache.  Lots of shaders trying to hit memory at
once will bog down, serialized really.  The main reason we have so may
shaders, actually, is because the proportion of math and flow control
instructions in a kernel should be high compared to the number of
memory accesses.

> The proposed architecture is nice given a mostly linear flow of
> instructions which only use local memory, but can deal with the more
> general case effectively?  If threads were much more lightweight, it
> would seem easier to come up with a solution.

What did you have in mind?


BTW, there's an important thing we have to keep in mind:  The
rasterizer is going to churn out at most one fragment per cycle.  That
means that tasks issued to shaders will be inherently staggered.
Probably another reason to ditch the shaded icaches and synchronized
tasks.


-- 
Timothy Normand Miller
http://www.cse.ohio-state.edu/~millerti
Open Graphics Project
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Reply via email to