On 2009-10-16, Timothy Normand Miller wrote: > On Fri, Oct 16, 2009 at 2:55 AM, Petter Urkedal <[email protected]> wrote: > > On 2009-10-15, Timothy Normand Miller wrote: > >> I've drawn this to illustrate the design of our shader engine: > >> > >> http://www.cse.ohio-state.edu/~millerti/shaderengine.pdf > >> > >> Several years ago, I designed an fp adder and an fp multiplier. IIRC, > >> they required 6 pipeline stages. Andre did something similar, but I > >> don't know what his results were. Also, Andre mentioned that there > >> are sections of the pipelines for those units that are similar enough > >> that they could be combined. Putting this together, we have a 9-stage > >> pipeline, which would imply 10 time slots for 10 (or more) tasks > >> assigned to the engine at once. It's 10 because 1/2 a BRAM stores the > >> regfile for one task. > > > > If we have one "free" thread, shouldn't we instead add a 1-slot queue in > > front of the pipeline, so that the pipeline still runs at full capacity > > while we have one pending load? > > Well, basically, we can try to just hop over the stalled thread, > selecting another one that is legal to issue at that time. We will > have a scoreboard of which threads are in which pipeline stages (so > for instance we know which regfile to store in on writeback or the > reader tag when issuing loads). We can easily use this to predict > which other thread is going to be in write-back at the time that any > thread we might issue will be in write-back. > > This may be another premature optimization that we'll want to keep on > the to-do list.
I think we can do this easier by adding to the register file the capability of putting a single write on hold until a non-read cycle occurs. On that note, can we write simultaneously to both BRAM ports, and if so, is the case where the addresses are the same well defined, meaning that one port takes precedence? That may simplify the logic slightly. Another thing. The fetch stage does not seem to depend on the write-back stage, so I think the tasks can be re-scheduled to the fetch stage at the same cycle they enter the write-back stage. Thus, we can reduce the cycle to 8 stages, allowing us to use only 8 tasks if needed. Note if we combine these two solutions, then the register file must check the incoming request against the pending write if present. This still seems a lot simpler than the register-forwarding in HQ. The case happens to the second thread of two BRAM-siblings which are scheduled on consecutive cycles. When the first thread is in decode, the second thread is in fetch and write-back, thus the write-back will be delayed. On the next cycle, the second thread is in decode, potentially requesting the value of the pending write. +--> [fetch] | [decode +-->wb] | [alu] | | [alu] | | [alu] | | [alu] | | [alu] | +--- [alu]----+ _______________________________________________ Open-graphics mailing list [email protected] http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
