Re: [Open-graphics] Design work: Turning memory controller also into a software problem

Timothy Miller Fri, 21 Jul 2006 08:14:52 -0700

On 7/21/06, Petter Urkedal <[EMAIL PROTECTED]> wrote:

I am considering the efficiency of an interpreter-based controller.
From the current thread and the DDR spec, we may want to be able to
issue read- or write-requests for every cycle. This is doable if the
'read' and 'write' instructions halt and wait for new requests rather
than exiting.  However, it may be difficult to avoid missing a cycle
when we need to precharge and the memory is ready for it.


Basically, row misses just hold everything up, and there's nothing we
can (or should) do anything about it, other than designing the GPU to
be relatively linear in its access patterns (which they generally are,
except for some texture operations).

I've been thinking about the interpreter-based approach, and I think
we can get it to run at speed.  On top of what others have suggested,
here are some ideas I have about bits that should be in the
instruction word:

- next address (rather than computing it, it's encoded in the
instruction directly)
- read-legal
- write-legal
- precharge-legal
- activate-legal
- memory bus command bits
- selectors for what to put on address bus

We'll use the block RAM in 32-bit mode so we can have very wide
instructions.  I can't imagine that 512 instructions wouldn't be
enough.  Since the BRAMs are dual-ported, we can share one BRAM across
two memory controllers.

Conditional branches are external.  Basically, either the next address
is the one in the instruction, or it's one from a register.

We'd make things like hit detection external and, in fact, earlier in
a pipeline.  Here are some examples of address registers to select
from:

- read (hit)
- read (no hit)
- write (hit)
- write (no hit)
- precharge all (for lmr)
- lmr
- refresh

So, now, we have no counters.  There's just a feedback MUX from the
instruction and some registers into the block RAM address.  Delays are
just instructions.

However, we might just mux between the "next address" and an address
that is generated by a combination of the flags that would be used to
select the register.  There may be some redundant addresses, but
that's okay.  Since "next" is encoded in the instruction, it's
absolutely no problem for the entry points to be grouped and the rest
of the routines to be elsewhere.

We don't need anything like an lmr-legal flag.  If the command is
issued, software has ensured that it's safe.  In fact, some of the
other flags may be unnecessary, like activate-legal, which may just be
part of a "row miss" routine.


OTOH, it seems very simple to code the time critical
read/write/precharge/active directly in hardware. Initialisation is more
complex, but we could still have that in "program" form where it is
simply an sequence of (ddr_signals, delay), no tests, no jump. Attached
is a sketch which implements read/write only, no init, no refresh, and
not debugged. If I haven't missed something essential, this doesn't seem
to be complex. Some of the parameters should be put into configurable
registers. Also, the interface may be a bit awkward as the client must
hold the input until it receives 'req_received', but at the same cycle
it can issue a new request. This can be fixed.


I looked over your design, and it kicks ass.  It's more elegant than
some I've done before.  The problem you're going to have (that I also
had) is that it won't be fast enough.  I'm not sure that the block-ram
method would be either, but in order to do something like what you're
doing, you've got to pipeline the heck out of it.

There's lots of stuff you can do earlier in the pipeline, such as
detecting row misses and such.  One approach I've considered (which
suffers from the evil two-level state machine problem) is something
like what I describe below.  Some of the pipelining might seem
excessive, but the idea is to get it to run at 200MHz.

First of all, since reads can come from multiple places and there's
lots of pipeline latency, read requests are addresses, accompanied by
a TAG.  This tag is used by upstream logic to figure out where the
data came from and whom it's going to.

Stage 0:
Deal with memory controller internal busy signals.  Basically, you mux
between the in-coming command and an in-coming command that was
registered on an earlier cycle.  This way, the out-going busy signal
is registered (very important).

Stage 1:
Look up the last row associated with the selected bank.  In that
table, replace the row number with the new one.
For a read, enqueue the tag.

Stage 2:
Compare the row address with the one looked up in the last stage.

Stage 3:
Split the command into a 1-hot encoding.
Determine if the row is open (might need a precharge) or closed
(definitely needs an activate).
Determine if there is a row miss based on the compare from the last
stage and whether or not the row is open.

Stage 4:
This stage determines if we're going to do the command issued, or
we're going to do something else.  For instance, if there's a row
miss, we need to hang onto the read/write command and issue precharge,
then activate, then the read/write.
This is the first state machine, which can be busy, holding up earlier stages.
This stage may also detect the need for a refresh and insert
appropriate commands ahead of anything else that might be coming in.

Stage 5:
This is the second state machine.
Execute commands generated by previous stage.
Issue commands directly to RAM chips.
Manage counters to insert appropriate timing delays.
This stage has its own busy signal, complicating things even further.

The mechanics of dealing with CAS latency are handled in other logic,
and that logic dequeues the tags when appropriate.


So... I have reason to believe that this would run at 200Mhz on a
3S4000.  But it would also be a nightmare to debug, unless we got it
exactly right on the first try.  If we were very careful in how we
designed it (not one line of code written until we have the logic
perfect), this might just work.

The reason I like the block-RAM design is that we might be able to get
it to run at speed and we'd be able to encode all of those inserted
commands into just the instruction sequence.  We can "waste" lots of
instruction entries and still have lots left over.

I assume the clock of the memory should be registered, as in the PROM
controller. But what about the double data rate? Do we have clock which
is pi/2 skewed, or does someone have black magic to teach? Also, it's
not clear to me what terminates a write burst. Is it a negative level of
the strobe where there should be of a rising edge?


This is a mechanical thing that's been solved many times.  Don't worry about it.

Nice work on this, BTW.  Thanks!
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Re: [Open-graphics] Design work: Turning memory controller also into a software problem

Reply via email to