Re: [Open-graphics] OGA2 SIMD/MIMD

Timothy Normand Miller Mon, 12 Oct 2009 14:35:12 -0700

On Mon, Oct 12, 2009 at 2:47 PM, Petter Urkedal <[email protected]> wrote:


>> > We should also not completely fix the number of threads per pipeline
>> > before we design it.  If we can save one stage, so we have 7 instead of
>> > 8 stages, do we have an argument for rounding up to 8?  If not, we can
>> > go with 7 threads and thus fit more pipelines.
>>
>> If we assign N threads to 1 pipeline, we need one icache, one of each
>> functional unit, but N register files.  If we allocate 256 registers
>> for a thread, then we'll want to assign threads in even numbers since
>> the BRAMs hold 512 words.
>
> If we end up with an odd number of stages, we might manage to share a
> BRAM between two pipelines if we can use both ports, or if we can
> arrange so that the access is never on the same cycle.

Each thread gets half a RAM for registers.  Let's say that threads A
and B are assigned to BRAM 0.  We just have to make sure that when
either A or B has an instruction in REG, that the write-back being
done on the same cycle belongs to some other thread, like G, which is
assigned to BRAM 3.  It shouldn't be too hard to make sure that the
pipeline is sized so that we never have reads and writes to the same
BRAM at the same time.

>> >> > It may be tempting to add one or
>> >> > two extra threads per ALU to keep the ALU busy, but due to the cost and
>> >> > the low frequency of loads, it may be better to send a "phantom" down
>> >> > the ALU for the thread doing the load.  The result of the load can be
>> >> > fetched back via a short return queue on each ALU.  This could be just
>> >> > one or two slots if we allow stalling on rare cases.  As soon as a
>> >> > "phantom" comes out of the ALU, a real thread is dequeued and passed
>> >> > down it place of it.
>> >>
>> >> I'm not sure, but you may be saying the same thing.  :)
>> >
>> > No.  The load instruction does not prevent the execution of the next
>> > thread.  Say thread I is a load and for simplicity only have 4 stages:
>> >
>> >  (  I0,   H0,   G0,   F0)
>> >  (  F1, noop,   H0,   G0)
>> >  (  G1,   F1, noop,   H0)
>> >  (  H1,   G1,   F1, noop)
>> >  (noop,   H1,   G1,   F1)  <-- return queue from load is still empty
>> >  (  F2, noop,   H1,   G1)  <-- result for thread I ready
>> >  (  G2,   F2, noop,   H1)
>> >  (  H2,   G2,   F2, noop)
>> >  (  I1,   H2,   G2,   F2)  <-- thread I re-scheduled
>> >
>> > It should be noted than if there are several loads in action for a
>> > pipeline, we can plug the first noop which is about to re-enter the
>> > pipeline with the first thread from the pipeline which finishes the
>> > load.
>>
>> Those noops could go on for a long time.  The memory latency is
>> unpredictable.
>
> Yes, but unless we are willing to spend an extra thread context, there
> is no way we can utilise the pipeline better than (N - 1)/N while there
> is a pending load in progress.

This is absolutely true.  A thread will send nothing (implicit NOOPs)
down the pipeline the whole time it's waiting on read data.  This is
unavoidable unless we allow for more than N tasks to be assigned to a
given engine.  I propose that for now, we just lump it.  Later, we can
investigate to determine just how common it is for this to happen.
How much time do tasks spend waiting?  How many additional threads do
we have to assign in order to reduce the average to zero?  What impact
does that have on total throughput across benchmarks?

There's a tradeoff.  If we assign one more BRAM to a shader, we can
add two more tasks in to the rotation.  This reduces the total number
of shaders we can fit on a chip, but it does so with little or no
impact on gate count for the pipeline.  If we're constrained by logic,
it'll be obvious that we should add more contexts to those shaders we
can fit.  If we're constrained by BRAMs, then we should reduce the
number of contexts so that we can fit in more shaders.

When we have a working design, we may be able to find some people who
would be willing to help with this design space exploration to
fine-tune it.

>
>> Besides, this is moot if we dispense with the MIMD.
>> I'm assuming your columns are different threads.  Getting rid of MIMD
>> means that we just round-robin N threads on one pipeline.  We're not
>> gluing together more than one pipeline.
>
> The above scheme is a way to round-robin of N threads on an N-wide
> pipeline with the ability to remove and add back threads.  I'm just
> spelling it out since I'm not sufficiently familiar with hardware design
> to know how all the details are worked out.  We have probably had the
> same design in mind from the start, and I was just confused about "NOOPs
> get issued [ins: on every N cycles] until the read return queue is no
> longer empty".

Splitting sets is easy.  You just detect that branch instructions are
taking different directions.  But recombining them requires a lot more
logic.  Plus, I'm becoming more and more convinced that we should drop
it and go for 1-wide pipeline.  (Although I'm open to allowing two to
share the same icache since BRAMs are dual-ported.)

There are two ways to handle the stall when waiting on memory data.
One is to implicitly issue an instruction that does nothing.  The
other is to have a "valid" flag that accompanies each instruction.
They're essentially identical in principle, and we should do whichever
is simpler and smaller (which is probably the former, so I'm going
with that until convinced otherwise).



-- 
Timothy Normand Miller
http://www.cse.ohio-state.edu/~millerti
Open Graphics Project
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Re: [Open-graphics] OGA2 SIMD/MIMD

Reply via email to