On 7/21/06, Petter Urkedal <[EMAIL PROTECTED]> wrote:
I am considering the efficiency of an interpreter-based controller. From the current thread and the DDR spec, we may want to be able to issue read- or write-requests for every cycle. This is doable if the 'read' and 'write' instructions halt and wait for new requests rather than exiting. However, it may be difficult to avoid missing a cycle when we need to precharge and the memory is ready for it.
Basically, row misses just hold everything up, and there's nothing we can (or should) do anything about it, other than designing the GPU to be relatively linear in its access patterns (which they generally are, except for some texture operations). I've been thinking about the interpreter-based approach, and I think we can get it to run at speed. On top of what others have suggested, here are some ideas I have about bits that should be in the instruction word: - next address (rather than computing it, it's encoded in the instruction directly) - read-legal - write-legal - precharge-legal - activate-legal - memory bus command bits - selectors for what to put on address bus We'll use the block RAM in 32-bit mode so we can have very wide instructions. I can't imagine that 512 instructions wouldn't be enough. Since the BRAMs are dual-ported, we can share one BRAM across two memory controllers. Conditional branches are external. Basically, either the next address is the one in the instruction, or it's one from a register. We'd make things like hit detection external and, in fact, earlier in a pipeline. Here are some examples of address registers to select from: - read (hit) - read (no hit) - write (hit) - write (no hit) - precharge all (for lmr) - lmr - refresh So, now, we have no counters. There's just a feedback MUX from the instruction and some registers into the block RAM address. Delays are just instructions. However, we might just mux between the "next address" and an address that is generated by a combination of the flags that would be used to select the register. There may be some redundant addresses, but that's okay. Since "next" is encoded in the instruction, it's absolutely no problem for the entry points to be grouped and the rest of the routines to be elsewhere. We don't need anything like an lmr-legal flag. If the command is issued, software has ensured that it's safe. In fact, some of the other flags may be unnecessary, like activate-legal, which may just be part of a "row miss" routine.
OTOH, it seems very simple to code the time critical read/write/precharge/active directly in hardware. Initialisation is more complex, but we could still have that in "program" form where it is simply an sequence of (ddr_signals, delay), no tests, no jump. Attached is a sketch which implements read/write only, no init, no refresh, and not debugged. If I haven't missed something essential, this doesn't seem to be complex. Some of the parameters should be put into configurable registers. Also, the interface may be a bit awkward as the client must hold the input until it receives 'req_received', but at the same cycle it can issue a new request. This can be fixed.
I looked over your design, and it kicks ass. It's more elegant than some I've done before. The problem you're going to have (that I also had) is that it won't be fast enough. I'm not sure that the block-ram method would be either, but in order to do something like what you're doing, you've got to pipeline the heck out of it. There's lots of stuff you can do earlier in the pipeline, such as detecting row misses and such. One approach I've considered (which suffers from the evil two-level state machine problem) is something like what I describe below. Some of the pipelining might seem excessive, but the idea is to get it to run at 200MHz. First of all, since reads can come from multiple places and there's lots of pipeline latency, read requests are addresses, accompanied by a TAG. This tag is used by upstream logic to figure out where the data came from and whom it's going to. Stage 0: Deal with memory controller internal busy signals. Basically, you mux between the in-coming command and an in-coming command that was registered on an earlier cycle. This way, the out-going busy signal is registered (very important). Stage 1: Look up the last row associated with the selected bank. In that table, replace the row number with the new one. For a read, enqueue the tag. Stage 2: Compare the row address with the one looked up in the last stage. Stage 3: Split the command into a 1-hot encoding. Determine if the row is open (might need a precharge) or closed (definitely needs an activate). Determine if there is a row miss based on the compare from the last stage and whether or not the row is open. Stage 4: This stage determines if we're going to do the command issued, or we're going to do something else. For instance, if there's a row miss, we need to hang onto the read/write command and issue precharge, then activate, then the read/write. This is the first state machine, which can be busy, holding up earlier stages. This stage may also detect the need for a refresh and insert appropriate commands ahead of anything else that might be coming in. Stage 5: This is the second state machine. Execute commands generated by previous stage. Issue commands directly to RAM chips. Manage counters to insert appropriate timing delays. This stage has its own busy signal, complicating things even further. The mechanics of dealing with CAS latency are handled in other logic, and that logic dequeues the tags when appropriate. So... I have reason to believe that this would run at 200Mhz on a 3S4000. But it would also be a nightmare to debug, unless we got it exactly right on the first try. If we were very careful in how we designed it (not one line of code written until we have the logic perfect), this might just work. The reason I like the block-RAM design is that we might be able to get it to run at speed and we'd be able to encode all of those inserted commands into just the instruction sequence. We can "waste" lots of instruction entries and still have lots left over.
I assume the clock of the memory should be registered, as in the PROM controller. But what about the double data rate? Do we have clock which is pi/2 skewed, or does someone have black magic to teach? Also, it's not clear to me what terminates a write burst. Is it a negative level of the strobe where there should be of a rising edge?
This is a mechanical thing that's been solved many times. Don't worry about it. Nice work on this, BTW. Thanks! _______________________________________________ Open-graphics mailing list [email protected] http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
