On Mon, Oct 12, 2009 at 7:01 PM, Petter Urkedal <[email protected]> wrote:
> On 2009-10-12, Timothy Normand Miller wrote:
>> >> Besides, this is moot if we dispense with the MIMD.
>> >> I'm assuming your columns are different threads.  Getting rid of MIMD
>> >> means that we just round-robin N threads on one pipeline.  We're not
>> >> gluing together more than one pipeline.
>> >
>> > The above scheme is a way to round-robin of N threads on an N-wide
>> > pipeline with the ability to remove and add back threads.  I'm just
>> > spelling it out since I'm not sufficiently familiar with hardware design
>> > to know how all the details are worked out.  We have probably had the
>> > same design in mind from the start, and I was just confused about "NOOPs
>> > get issued [ins: on every N cycles] until the read return queue is no
>> > longer empty".
>>
>> Splitting sets is easy.  You just detect that branch instructions are
>> taking different directions.  But recombining them requires a lot more
>> logic.  Plus, I'm becoming more and more convinced that we should drop
>> it and go for 1-wide pipeline.  (Although I'm open to allowing two to
>> share the same icache since BRAMs are dual-ported.)
>
> If we go with N independent threads as we have discussed, then no hard
> recombination is needed.  Assuming 1-wide means 1 thread per pipeline, I
> suspect this will make the design more difficult.  The pipeline will be
> deeper than that of HQ, so won't this introduce a lot of register
> forwarding and potentially tight timing constraints in order to keep the
> instruction semantics sane?  On the other hand, if we need to save BRAM,
> that's an excellent argument for reducing the number of threads per
> pipeline.

One wide, N deep.  That is, each shader will be assigned N threads,
where N is greater than or equal to the number of pipeline stages.
But instructions for those threads will be issued round-robin (on a
per-cycle basis).

There's confusion using this word "wide" since an earlier iteration of
the design was issuing instructions from multiple threads to multiple
pipelines on the SAME cycle.

In the design we're proposing, we have no need for forwarding, branch
prediction, branch delay slots, or any of those things.  For any given
thread, instructions are issued only every 8 (or so) cycles.

>
>> There are two ways to handle the stall when waiting on memory data.
>> One is to implicitly issue an instruction that does nothing.  The
>> other is to have a "valid" flag that accompanies each instruction.
>> They're essentially identical in principle, and we should do whichever
>> is simpler and smaller (which is probably the former, so I'm going
>> with that until convinced otherwise).
>
> There is another fine detail which I mentioned.  When there are several
> pending loads for a certain pipeline, we may allow the pipeline to
> reschedule the first to finish in place of the first hole which is about
> to reenter the pipeline.  If we use this minor optimisation, then it's
> most natural to pass down a "noop".

I don't understand what you're saying.

Let's say there are 8 pipeline stages.  We'll create 8 or more time
slots (cycles), one slot for each simultaneous task assigned to the
shader.  Initially, if a task is stalled waiting on a read, we simply
issue nops during the cycles assigned to that task.  Later, we can
consider adding more tasks so that if some task is stalled we can skip
over it and issue another task's instruction in that cycle.


-- 
Timothy Normand Miller
http://www.cse.ohio-state.edu/~millerti
Open Graphics Project
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Reply via email to