Re: [Open-graphics] Shader engine block diagram

Petter Urkedal Sat, 17 Oct 2009 06:18:46 -0700

On 2009-10-16, Timothy Normand Miller wrote:
> On Fri, Oct 16, 2009 at 2:55 AM, Petter Urkedal <[email protected]> wrote:
> > On 2009-10-15, Timothy Normand Miller wrote:
> >> I've drawn this to illustrate the design of our shader engine:
> >>
> >> http://www.cse.ohio-state.edu/~millerti/shaderengine.pdf
> >>
> >> Several years ago, I designed an fp adder and an fp multiplier.  IIRC,
> >> they required 6 pipeline stages.  Andre did something similar, but I
> >> don't know what his results were.  Also, Andre mentioned that there
> >> are sections of the pipelines for those units that are similar enough
> >> that they could be combined.  Putting this together, we have a 9-stage
> >> pipeline, which would imply 10 time slots for 10 (or more) tasks
> >> assigned to the engine at once.  It's 10 because 1/2 a BRAM stores the
> >> regfile for one task.
> >
> > If we have one "free" thread, shouldn't we instead add a 1-slot queue in
> > front of the pipeline, so that the pipeline still runs at full capacity
> > while we have one pending load?
> 
> Well, basically, we can try to just hop over the stalled thread,
> selecting another one that is legal to issue at that time.  We will
> have a scoreboard of which threads are in which pipeline stages (so
> for instance we know which regfile to store in on writeback or the
> reader tag when issuing loads).  We can easily use this to predict
> which other thread is going to be in write-back at the time that any
> thread we might issue will be in write-back.
> 
> This may be another premature optimization that we'll want to keep on
> the to-do list.


I think we can do this easier by adding to the register file the
capability of putting a single write on hold until a non-read cycle
occurs.  On that note, can we write simultaneously to both BRAM ports,
and if so, is the case where the addresses are the same well defined,
meaning that one port takes precedence?  That may simplify the logic
slightly.

Another thing.  The fetch stage does not seem to depend on the
write-back stage, so I think the tasks can be re-scheduled to the fetch
stage at the same cycle they enter the write-back stage.  Thus, we can
reduce the cycle to 8 stages, allowing us to use only 8 tasks if needed.

Note if we combine these two solutions, then the register file must
check the incoming request against the pending write if present.  This
still seems a lot simpler than the register-forwarding in HQ.  The case
happens to the second thread of two BRAM-siblings which are scheduled on
consecutive cycles.  When the first thread is in decode, the second
thread is in fetch and write-back, thus the write-back will be delayed.
On the next cycle, the second thread is in decode, potentially
requesting the value of the pending write.

  +--> [fetch]
  |    [decode  +-->wb]
  |    [alu]    |
  |    [alu]    |
  |    [alu]    |
  |    [alu]    |
  |    [alu]    |
  +--- [alu]----+
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Re: [Open-graphics] Shader engine block diagram

Reply via email to