On Sat, Oct 17, 2009 at 9:07 AM, Petter Urkedal <[email protected]> wrote:
> On 2009-10-16, Timothy Normand Miller wrote:
>> On Fri, Oct 16, 2009 at 2:55 AM, Petter Urkedal <[email protected]> wrote:
>> > On 2009-10-15, Timothy Normand Miller wrote:
>> >> I've drawn this to illustrate the design of our shader engine:
>> >>
>> >> http://www.cse.ohio-state.edu/~millerti/shaderengine.pdf
>> >>
>> >> Several years ago, I designed an fp adder and an fp multiplier.  IIRC,
>> >> they required 6 pipeline stages.  Andre did something similar, but I
>> >> don't know what his results were.  Also, Andre mentioned that there
>> >> are sections of the pipelines for those units that are similar enough
>> >> that they could be combined.  Putting this together, we have a 9-stage
>> >> pipeline, which would imply 10 time slots for 10 (or more) tasks
>> >> assigned to the engine at once.  It's 10 because 1/2 a BRAM stores the
>> >> regfile for one task.
>> >
>> > If we have one "free" thread, shouldn't we instead add a 1-slot queue in
>> > front of the pipeline, so that the pipeline still runs at full capacity
>> > while we have one pending load?
>>
>> Well, basically, we can try to just hop over the stalled thread,
>> selecting another one that is legal to issue at that time.  We will
>> have a scoreboard of which threads are in which pipeline stages (so
>> for instance we know which regfile to store in on writeback or the
>> reader tag when issuing loads).  We can easily use this to predict
>> which other thread is going to be in write-back at the time that any
>> thread we might issue will be in write-back.
>>
>> This may be another premature optimization that we'll want to keep on
>> the to-do list.
>
> I think we can do this easier by adding to the register file the
> capability of putting a single write on hold until a non-read cycle
> occurs.  On that note, can we write simultaneously to both BRAM ports,
> and if so, is the case where the addresses are the same well defined,
> meaning that one port takes precedence?  That may simplify the logic
> slightly.

I don't think it simplifies things, but it's not a bad idea.  The fact
is, we can write to both ports of any BRAM except the one being read
from.

> Another thing.  The fetch stage does not seem to depend on the
> write-back stage, so I think the tasks can be re-scheduled to the fetch
> stage at the same cycle they enter the write-back stage.  Thus, we can
> reduce the cycle to 8 stages, allowing us to use only 8 tasks if needed.

This is what I get:

stage                                                           
0       2       3       4       5       6       7       0       1
1       1       2       3       4       5       6       7       0
2       0       1       2       3       4       5       6       7
3       7       0       1       2       3       4       5       6
4       6       7       0       1       2       3       4       5
5       5       6       7       0       1       2       3       4
6       4       5       6       7       0       1       2       3
7       3       4       5       6       7       0       1       2
8       2       3       4       5       6       7       0       1
wb      1       2       3       4       5       6       7       0

As you say, we would have to have a forwarding mechanism in decode to
pass through any collisions between read and writeback.  The only
problem I see is that the multiplexing would have to be after the
registered BRAM, putting it into the critical path of stage 2.  We
would do the comparison in stage 1, generating a single bit to control
the MUX in stage 2, so it's not much, but it would still be nice to
avoid this extra logic entirely, especially if we have future plans to
assign more contexts to a shader.  (Prematurely optimizing is bad.
But avoiding an optimization that might be undone later doesn't seem
so bad to me.)

Also, in any case, how threads map to BRAMs has to be chosen to avoid
collision unless we do as you suggested above AND make sure that the
writeback always happens before the same thread gets into decode
again, even when the writeback is delayed due to collision.  I suspect
that we wouldn't need a buffer of more than two pending writes and the
forwarding mechanism to make this work.

> Note if we combine these two solutions, then the register file must
> check the incoming request against the pending write if present.  This
> still seems a lot simpler than the register-forwarding in HQ.  The case
> happens to the second thread of two BRAM-siblings which are scheduled on
> consecutive cycles.  When the first thread is in decode, the second
> thread is in fetch and write-back, thus the write-back will be delayed.
> On the next cycle, the second thread is in decode, potentially
> requesting the value of the pending write.
>
>  +--> [fetch]
>  |    [decode  +-->wb]
>  |    [alu]    |
>  |    [alu]    |
>  |    [alu]    |
>  |    [alu]    |
>  |    [alu]    |
>  +--- [alu]----+

For the first iteration, with fixed slots, all we have to do is make
sure that the mapping from thread numbers to brams avoids collision.
Later, we'll need a mechanism like what we've discussed here.

-- 
Timothy Normand Miller
http://www.cse.ohio-state.edu/~millerti
Open Graphics Project
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Reply via email to