Re: [Open-graphics] OGA2 SIMD/MIMD

Petter Urkedal Mon, 12 Oct 2009 11:51:16 -0700

On 2009-10-11, Timothy Normand Miller wrote:
> If you can think of a way to do this, propose it.  But I think we
> really should stick with a more traditional, well-understood
> architecture for the first pass.  Something too exotic could very well
> be ignored as too difficulty by those whom we need help from to write
> compilers for this.
 
[...]


> And for this optimization, I'm expecting we'll have to rely on the
> existing GCC infrastructure.  We have to fight our urges to be overly
> clever with this architecture when it complicates things for others,
> because we have a very practical goal here:  Design a GPU that works.
> Remember, we can always revise and revise.  Nothing is set in stone.
> Designing something mundane in and of itself will be educational for
> us.  So don't underestimate the value of reinventing the wheel.

[...]

> This shader design will be FPGA-only for a long time.  Don't make the
> microcode programmable for our sake.  Only make something programmable
> if it'll be very likely to (a) make it easier to write the compiler or
> less importantly (b) make the tasks run faster.

I hear you ;-)
 
> > We should also not completely fix the number of threads per pipeline
> > before we design it.  If we can save one stage, so we have 7 instead of
> > 8 stages, do we have an argument for rounding up to 8?  If not, we can
> > go with 7 threads and thus fit more pipelines.
> 
> If we assign N threads to 1 pipeline, we need one icache, one of each
> functional unit, but N register files.  If we allocate 256 registers
> for a thread, then we'll want to assign threads in even numbers since
> the BRAMs hold 512 words.

If we end up with an odd number of stages, we might manage to share a
BRAM between two pipelines if we can use both ports, or if we can
arrange so that the access is never on the same cycle.

> >> > It may be tempting to add one or
> >> > two extra threads per ALU to keep the ALU busy, but due to the cost and
> >> > the low frequency of loads, it may be better to send a "phantom" down
> >> > the ALU for the thread doing the load.  The result of the load can be
> >> > fetched back via a short return queue on each ALU.  This could be just
> >> > one or two slots if we allow stalling on rare cases.  As soon as a
> >> > "phantom" comes out of the ALU, a real thread is dequeued and passed
> >> > down it place of it.
> >>
> >> I'm not sure, but you may be saying the same thing.  :)
> >
> > No.  The load instruction does not prevent the execution of the next
> > thread.  Say thread I is a load and for simplicity only have 4 stages:
> >
> >  (  I0,   H0,   G0,   F0)
> >  (  F1, noop,   H0,   G0)
> >  (  G1,   F1, noop,   H0)
> >  (  H1,   G1,   F1, noop)
> >  (noop,   H1,   G1,   F1)  <-- return queue from load is still empty
> >  (  F2, noop,   H1,   G1)  <-- result for thread I ready
> >  (  G2,   F2, noop,   H1)
> >  (  H2,   G2,   F2, noop)
> >  (  I1,   H2,   G2,   F2)  <-- thread I re-scheduled
> >
> > It should be noted than if there are several loads in action for a
> > pipeline, we can plug the first noop which is about to re-enter the
> > pipeline with the first thread from the pipeline which finishes the
> > load.
> 
> Those noops could go on for a long time.  The memory latency is
> unpredictable.

Yes, but unless we are willing to spend an extra thread context, there
is no way we can utilise the pipeline better than (N - 1)/N while there
is a pending load in progress.

> Besides, this is moot if we dispense with the MIMD.
> I'm assuming your columns are different threads.  Getting rid of MIMD
> means that we just round-robin N threads on one pipeline.  We're not
> gluing together more than one pipeline.

The above scheme is a way to round-robin of N threads on an N-wide
pipeline with the ability to remove and add back threads.  I'm just
spelling it out since I'm not sufficiently familiar with hardware design
to know how all the details are worked out.  We have probably had the
same design in mind from the start, and I was just confused about "NOOPs
get issued [ins: on every N cycles] until the read return queue is no
longer empty".
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Re: [Open-graphics] OGA2 SIMD/MIMD

Reply via email to