On 9/8/07, Petter Urkedal <[EMAIL PROTECTED]> wrote: > On 2007-09-06, Timothy Normand Miller wrote: > > On 9/6/07, Petter Urkedal <[EMAIL PROTECTED]> wrote: > > > > I'm just worried about the race conditions. In fact, I know they'll > > > > be a problem. We really don't want to give the nanocontroller a > > > > separate pipe to memory. So if we have pending reads, then data will > > > > come back out of order. > > > > > > I though we could solve that by encoding the thread number in the > > > request and reply. An attempt to read a data which the thread does not > > > own, would cause a context switch. > > > > I'm trying to imagine this, and I can't see a solution that wouldn't > > be more complicated than what I'm proposing. Could you go into some > > detail about assumptions you're making about the memory system? How > > is it to know if a read word was requested by one thread or the other? > > And does this mean that thread switching is completely automatic? > > What other conditions would cause a thread switch? > > (Note that is mostly of academic interest, since we'll try to avoid > context if possible. The nanocontroller is a simple design.) > > Let's focus on the tricky part, the reads. My assumption is that the > nanocontroller can write requests to the sink-end of a FIFO and read > from the source-end of another FIFO. The threading extension involves > tagging each request with a thread number, which will be propagated back > to the other FIFO.
Yeah, but that path is long and complicated. It's true that the memory controller has a "read tag" queue in it. This way, we can interleave engine, PCI, and video reads and have the data return to the right places. But there's a lot of logic between the nanocontroller and the memory controller. > First, a simplified suboptimal solution without continuous > thread-switching: When a thread requests reading from the source, one > of two things can happen: > > * The source is nonempty and the next data belongs to the current > thread. > > * The source is empty or the next data belongs to another thread. > > In the former case the instruction is executed. In the latter case, the > read instruction will be propagated down the pipeline as a noop, the PC > is reset to point to the same read instruction, and a context switch > happens. Another case which could trigger a context switch, would be > interrupts. We already have a branch-delay architecture. We do this so we don't have to have any flow control in the pipeline. By the time this read gets down to the MEMIO stage, a few other instructions have already been fetched and partially processed. We would have to flush the pipeline, ensuring that there are no effects of those instructions. What if one of those following instructions is itself a branch? What then? If we're going to do this, we might as well have a design with flow control, exceptions, interrupts, and the like. Think about the complexity necessary to do this. > However, if we can multi-task the most critical code, we should try to > guarantee that > > I1. The same thread never runs in two consecutive cycles. > > That way we > * get rid of the delay slot, > * can rip out half of the register forwarding, and > * can pipeline the adder over two stages without imposing > data-dependency constraints in the machine code. Those are all nice benefits. I'm just wondering if a half-speed thread will keep up with some of the chores? > How would we implement a smart scheduler which runs threads > alternatively and be predictive enough about reads that it causes the > minimum number of noops to be inserted? Yeah... you're ahead of me here. :) I should read whole posts before responding. :) > First, with only 2 context, I1 predetermines that the two threads must > be run alternatively. That could easily mean that half of the cycles > will process inserted noops when both threads tries to read. Therefore, > with 2 threads should abandon I1, or use a smarter read-data FILO which > splits the replies into two sources, one for each thread. We already have a proliferation of FIFOs. I'm already suggesting too many. Adding more.... ugh. > With 4 contexts the scheduling gets more interesting. One thing we can > decide right away is that > > S1. If there is data in the read-reply FIFO and the owner was not > run on the previous cycles, then we run the owner. > > Otherwise, we know that any thread which tries to read can not proceed. > Further heuristic is difficult since we don't know if the next > instruction for a thread is a read before we fetch it. The easiest > solution is probably to make an attempt, if we fail we set a > "pending-read" bit for that thread, and we don't re-schedule it before > S1 applies. A more predictive heuristic is to keep track of how many See, all of this makes me think about things like branch prediction and the like. Branch predition, of course, requires that we be able to flush the pipeline and restart. > pending read-replies each thread has, and run the thread with least > pending replies, the argument being that it is the least likely to > block. It seems that with 4 context we could also benefit from routing > read-replies to separate sources for each thread. Ok, this is a good predictive mechanism. > If we more than one async source to deal with, scheduling decisions > become more complex, however, it such a case we could probably exploit > the independence of the resources, so that only selected threads have > access to each resource. I just don't like the idea of having to have any sort of instruction scheduling (besides the most trivial) in a design that we're trying to keep as small as possible. > > > Given these conclusions, we are close to the desired nanocontroller: > > > First, lets `ifdef out the multiplier logic. Then, maybe we turn IO > > > access into registers. If not, a minor practical-aesthetic point is > > > that I'd suggest negative addresses for IO-ports because it lets us > > > expand the scratch memory without changing the IO base address. Then, > > > we can try to synthesise it again. > > > > Minor nit-pick. I'd suggest choosing a high address bit (but > > something within the range of our immediates) to specify the bottom of > > I/O port space. 16384, I guess, from 15-bit unsigned immediates. The > > reason I don't like negative addresses is that it introduces > > additional math that I don't want to deal with, even if that doesn't > > translate into any real hardware. Note that if the immediate gets > > sign extended, it doesn't make any difference. We just ignore the > > upper bits anyhow, and we treat -16384 as the offset for I/O ports. > > So we use bit 15 as the "is it scratch or I/O" flag. > > In the current version at least immediates are always sign extended, and > I don't see a reason to change that. So, the two schemes are equivalent > seen from any immediate-encodable address. However, if we were to > extend scratch memory beyond the range of immediates, then the I/O area > will shadow the range just above the highest positive immediate address, > whereas if we use bit 31 as the "scratch or I/O"-flag then I/O space will > always be out of the way. Fair enough. -- Timothy Normand Miller http://www.cse.ohio-state.edu/~millerti Open Graphics Project _______________________________________________ Open-graphics mailing list [email protected] http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
