On Mon, Oct 12, 2009 at 2:47 PM, Petter Urkedal <[email protected]> wrote:
>> > We should also not completely fix the number of threads per pipeline >> > before we design it. If we can save one stage, so we have 7 instead of >> > 8 stages, do we have an argument for rounding up to 8? If not, we can >> > go with 7 threads and thus fit more pipelines. >> >> If we assign N threads to 1 pipeline, we need one icache, one of each >> functional unit, but N register files. If we allocate 256 registers >> for a thread, then we'll want to assign threads in even numbers since >> the BRAMs hold 512 words. > > If we end up with an odd number of stages, we might manage to share a > BRAM between two pipelines if we can use both ports, or if we can > arrange so that the access is never on the same cycle. Each thread gets half a RAM for registers. Let's say that threads A and B are assigned to BRAM 0. We just have to make sure that when either A or B has an instruction in REG, that the write-back being done on the same cycle belongs to some other thread, like G, which is assigned to BRAM 3. It shouldn't be too hard to make sure that the pipeline is sized so that we never have reads and writes to the same BRAM at the same time. >> >> > It may be tempting to add one or >> >> > two extra threads per ALU to keep the ALU busy, but due to the cost and >> >> > the low frequency of loads, it may be better to send a "phantom" down >> >> > the ALU for the thread doing the load. The result of the load can be >> >> > fetched back via a short return queue on each ALU. This could be just >> >> > one or two slots if we allow stalling on rare cases. As soon as a >> >> > "phantom" comes out of the ALU, a real thread is dequeued and passed >> >> > down it place of it. >> >> >> >> I'm not sure, but you may be saying the same thing. :) >> > >> > No. The load instruction does not prevent the execution of the next >> > thread. Say thread I is a load and for simplicity only have 4 stages: >> > >> > ( I0, H0, G0, F0) >> > ( F1, noop, H0, G0) >> > ( G1, F1, noop, H0) >> > ( H1, G1, F1, noop) >> > (noop, H1, G1, F1) <-- return queue from load is still empty >> > ( F2, noop, H1, G1) <-- result for thread I ready >> > ( G2, F2, noop, H1) >> > ( H2, G2, F2, noop) >> > ( I1, H2, G2, F2) <-- thread I re-scheduled >> > >> > It should be noted than if there are several loads in action for a >> > pipeline, we can plug the first noop which is about to re-enter the >> > pipeline with the first thread from the pipeline which finishes the >> > load. >> >> Those noops could go on for a long time. The memory latency is >> unpredictable. > > Yes, but unless we are willing to spend an extra thread context, there > is no way we can utilise the pipeline better than (N - 1)/N while there > is a pending load in progress. This is absolutely true. A thread will send nothing (implicit NOOPs) down the pipeline the whole time it's waiting on read data. This is unavoidable unless we allow for more than N tasks to be assigned to a given engine. I propose that for now, we just lump it. Later, we can investigate to determine just how common it is for this to happen. How much time do tasks spend waiting? How many additional threads do we have to assign in order to reduce the average to zero? What impact does that have on total throughput across benchmarks? There's a tradeoff. If we assign one more BRAM to a shader, we can add two more tasks in to the rotation. This reduces the total number of shaders we can fit on a chip, but it does so with little or no impact on gate count for the pipeline. If we're constrained by logic, it'll be obvious that we should add more contexts to those shaders we can fit. If we're constrained by BRAMs, then we should reduce the number of contexts so that we can fit in more shaders. When we have a working design, we may be able to find some people who would be willing to help with this design space exploration to fine-tune it. > >> Besides, this is moot if we dispense with the MIMD. >> I'm assuming your columns are different threads. Getting rid of MIMD >> means that we just round-robin N threads on one pipeline. We're not >> gluing together more than one pipeline. > > The above scheme is a way to round-robin of N threads on an N-wide > pipeline with the ability to remove and add back threads. I'm just > spelling it out since I'm not sufficiently familiar with hardware design > to know how all the details are worked out. We have probably had the > same design in mind from the start, and I was just confused about "NOOPs > get issued [ins: on every N cycles] until the read return queue is no > longer empty". Splitting sets is easy. You just detect that branch instructions are taking different directions. But recombining them requires a lot more logic. Plus, I'm becoming more and more convinced that we should drop it and go for 1-wide pipeline. (Although I'm open to allowing two to share the same icache since BRAMs are dual-ported.) There are two ways to handle the stall when waiting on memory data. One is to implicitly issue an instruction that does nothing. The other is to have a "valid" flag that accompanies each instruction. They're essentially identical in principle, and we should do whichever is simpler and smaller (which is probably the former, so I'm going with that until convinced otherwise). -- Timothy Normand Miller http://www.cse.ohio-state.edu/~millerti Open Graphics Project _______________________________________________ Open-graphics mailing list [email protected] http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
