On 2009-10-17, Timothy Normand Miller wrote:
> On Sat, Oct 17, 2009 at 9:07 AM, Petter Urkedal <[email protected]> wrote:
> > On 2009-10-16, Timothy Normand Miller wrote:
> >> On Fri, Oct 16, 2009 at 2:55 AM, Petter Urkedal <[email protected]> wrote:
> >> > On 2009-10-15, Timothy Normand Miller wrote:
> >> >> I've drawn this to illustrate the design of our shader engine:
> >> >>
> >> >> http://www.cse.ohio-state.edu/~millerti/shaderengine.pdf
> >> >>
> >> >> Several years ago, I designed an fp adder and an fp multiplier.  IIRC,
> >> >> they required 6 pipeline stages.  Andre did something similar, but I
> >> >> don't know what his results were.  Also, Andre mentioned that there
> >> >> are sections of the pipelines for those units that are similar enough
> >> >> that they could be combined.  Putting this together, we have a 9-stage
> >> >> pipeline, which would imply 10 time slots for 10 (or more) tasks
> >> >> assigned to the engine at once.  It's 10 because 1/2 a BRAM stores the
> >> >> regfile for one task.
> >> >
> >> > If we have one "free" thread, shouldn't we instead add a 1-slot queue in
> >> > front of the pipeline, so that the pipeline still runs at full capacity
> >> > while we have one pending load?
> >>
> >> Well, basically, we can try to just hop over the stalled thread,
> >> selecting another one that is legal to issue at that time.  We will
> >> have a scoreboard of which threads are in which pipeline stages (so
> >> for instance we know which regfile to store in on writeback or the
> >> reader tag when issuing loads).  We can easily use this to predict
> >> which other thread is going to be in write-back at the time that any
> >> thread we might issue will be in write-back.
> >>
> >> This may be another premature optimization that we'll want to keep on
> >> the to-do list.
> >
> > I think we can do this easier by adding to the register file the
> > capability of putting a single write on hold until a non-read cycle
> > occurs.  On that note, can we write simultaneously to both BRAM ports,
> > and if so, is the case where the addresses are the same well defined,
> > meaning that one port takes precedence?  That may simplify the logic
> > slightly.
> 
> I don't think it simplifies things, but it's not a bad idea.  The fact
> is, we can write to both ports of any BRAM except the one being read
> from.

I haven't reflected on how much logic is associated with the scoreboard,
but note that only a single hold state is needed, so we're talking about
about 32 + 8 registers and some fairly simple logic.

> > Another thing.  The fetch stage does not seem to depend on the
> > write-back stage, so I think the tasks can be re-scheduled to the fetch
> > stage at the same cycle they enter the write-back stage.  Thus, we can
> > reduce the cycle to 8 stages, allowing us to use only 8 tasks if needed.
> 
> This is what I get:
> 
> stage                                                         
> 0     2       3       4       5       6       7       0       1
> 1     1       2       3       4       5       6       7       0
> 2     0       1       2       3       4       5       6       7
> 3     7       0       1       2       3       4       5       6
> 4     6       7       0       1       2       3       4       5
> 5     5       6       7       0       1       2       3       4
> 6     4       5       6       7       0       1       2       3
> 7     3       4       5       6       7       0       1       2
> 8     2       3       4       5       6       7       0       1
> wb    1       2       3       4       5       6       7       0
> 
> As you say, we would have to have a forwarding mechanism in decode to
> pass through any collisions between read and writeback.  The only
> problem I see is that the multiplexing would have to be after the
> registered BRAM, putting it into the critical path of stage 2.  We
> would do the comparison in stage 1, generating a single bit to control
> the MUX in stage 2, so it's not much, but it would still be nice to
> avoid this extra logic entirely, especially if we have future plans to
> assign more contexts to a shader.  (Prematurely optimizing is bad.
> But avoiding an optimization that might be undone later doesn't seem
> so bad to me.)

You're right that the MUX between the holding state and the BRAM spills
into the next stage.  On the other hand, I believe that stage mostly
needs to do a comparison of exponents, since we can push the integer
math one stage down if necessary.  That's assuming we can't also fit the
shift of the mantissa at the same stage without creating a timing
bottleneck.

> Also, in any case, how threads map to BRAMs has to be chosen to avoid
> collision unless we do as you suggested above AND make sure that the
> writeback always happens before the same thread gets into decode
> again, even when the writeback is delayed due to collision.  I suspect
> that we wouldn't need a buffer of more than two pending writes and the
> forwarding mechanism to make this work.

We only need one pending write, as I describe below.  A simpler way to
see it is that there are only two writers.  Then one that is ahead of
the other (after cutting the cycle at the longest separation), will
always succeed with the with the write-back.

> > Note if we combine these two solutions, then the register file must
> > check the incoming request against the pending write if present.  This
> > still seems a lot simpler than the register-forwarding in HQ.  The case
> > happens to the second thread of two BRAM-siblings which are scheduled on
> > consecutive cycles.  When the first thread is in decode, the second
> > thread is in fetch and write-back, thus the write-back will be delayed.
> > On the next cycle, the second thread is in decode, potentially
> > requesting the value of the pending write.
> >
> >  +--> [fetch]
> >  |    [decode  +-->wb]
> >  |    [alu]    |
> >  |    [alu]    |
> >  |    [alu]    |
> >  |    [alu]    |
> >  |    [alu]    |
> >  +--- [alu]----+
> 
> For the first iteration, with fixed slots, all we have to do is make
> sure that the mapping from thread numbers to brams avoids collision.
> Later, we'll need a mechanism like what we've discussed here.

I'd go for the commutable threads right away, due to it's slight overall
effect on the design, and since non-commutable threads means that we
have to deal with fixing up the order of threads coming out of load.
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Reply via email to