Re: [Open-graphics] Design work: Turning memory controller also into a software problem

Timothy Miller Tue, 18 Jul 2006 11:30:07 -0700

On 7/18/06, Petter Urkedal <[EMAIL PROTECTED]> wrote:

On 2006-07-18, Timothy Miller wrote:
> In the past, I have arranged memory so that banks divide the memory
> into quarters.  This way, you can read from something in one bank and
> write to a different row in another bank.  So, if you put the display
> buffer at the beginning of memory and pixmaps at the end, you can
> bitblt from pixmaps to screen with a minimum of row misses.


Sounds good, so if the controller just uses the bank select as upper
address lines, the client hardware can take care of efficient
utilisation.


Yes, and we can use various levels of optimization.  If the main
pixmap/texture activity is copying go the screen, then it's simple
enough to allocate pixmaps from high addresses downward.  If we find
other access patterns, we can divide the memory space into four banks
and put things in optimal places.  For instance, when applying two
textures, it would be best to have them in different banks.  In the
extreme, we keep track of statitics as metadata and migrate things
around for maximum performance.  And the great thing is that it's 100%
software.

That clears up a few things. I can see that pins connect differetly depending
on access width. Are we going to use 16 bit mode? Also, what widths
shall the controller support? I think we can support 16 and 32 bits
almost for free with a programmable controller.


In the abstract, it doesn't matter.  For OGD1, we'll be controlling
pairs of 16-bit chips in lock-step, so it's no different from single
32-bit chips.  All that changes is the size of the data word and the
write-enables.  (Thus, we would make the width a parameter in the
verilog.)


As follows is a revised instruction set, exploiting a direct mapping
from opcode bits to some of the control signals for most opcodes.

====
opcode[4:1] is roughly (CS_, RAS_, CAS_, WE_)


I forgot about CS.  We basically don't use it, because no two chips
share data lines.  Some value could come of it if we wanted to control
the two 16-bit chips separately, but we might as well control them in
parallel.  Leave that in for now, but if you run out of bits, we can
remove it.  My philosophy wrt DRAMs is point-to-point for max
performance.  Perhaps others would like the scalability of being able
to put more RAMs on the same signal lines, but we're already capable
of addressing 256 megs.  For a nonprogrammable fragment shader, 256
may be overkill, but we do it for the bus width.  On the other hand,
with programmable shaders, where we have geometry info and shader
programs stored in graphics memory, we would want to throw gigabytes
(potentially) at the GPU.

opcode[0] is one of {A[10], CKE, BA[0]} for some of the commands

opcode     instruction          operands                special
-----------------------------------------------------------------------
1vvvvvvv S jump                 target = opcode[6:0]
01000000 S done
01vvvvvv S if_row_hit           target = opcode[5:0]
001vvvvv S wait                 count = opcode[4:0]; 0 means 32
00010000   dselect
0001001v S set_cke              level = opcode[0]
0000111x   no_operation
0000011x   active               (bank, row, _) = addr
0000101x   read16               (bank, _, col) = addr
00001000   write8               (bank, _, col) = addr   DM=2^addr[0]
00001001   write16              (bank, _, col) = addr   DM=3


I tend to think of the DMs as just part of an extended data word.
Since you're not encoding the data in the instructions, maybe we
shouldn't encode the DMs either.

0000110x   burst_terminate
00000100   precharge_bank       (bank, _, _) = addr     A[10]=0
00000101   precharge_all                                A[10]=1
00000010   self_refresh                                 CKE=0
00000011   auto_refresh                                 CKE=1
00000000 E load_mode_reg        13 bit operand          BA[0]=0
    xxxvvvvv
    vvvvvvvv
00000001 E load_extmode_reg     2 bit operand           BA[0]=1
    xxxvvvvv
    vvvvvvvv

Flags (second column):
    S: software or special instruction, operand bits do not correspond to
       the CS_, RAS_, CAS_, WE_ lines.
    E: 16 bit extended operand loaded from program memory
====

One possible complication is the operand of load_mode_reg and
load_extmode_reg. It would be convenient to have the operand for those
in program memory, at least for some tasks like initialisation. That
means that the hardware must fetch two consequent bytes from program
memory and shift them into the address. Maybe not a big deal, just two
extra states.


This is fascinating and very elaborate.  Probably just what we need.
Unfortunately, I need to get back to working on PCI, so I can't put
enough time into understanding it fully just yet.

Here are a few questions regarding how this design will work:

If we've just done a read/write, what allows another read/write to
happen immediately afterwards?

If we've just done a read/write, what causes a subsequent write/read
on the same row to be delayed by read2write/write2read cycles?

If we've just done a read/write, what causes a row miss to make a
precharge wait read2precharge cycles?

Here's a gotcha:  (activate2read + read2precharge) < activate2precharge
So, if we're just done a read asap after an activate, what forces the
greater of activate2precharge and read2precharge cycles of wait?

Here are the timing numbers I think you'll need:

read2write (CL + some constant)
write2read (tWTR + some constant)
activate2rw (tRCD)
write2precharge (tWR + some constant)
read2precharge (CL + some constant)
precharge2activate (tRP)
activate2precharge (tRAS)
refresh2activate (tRFC)

There's a number of other constraints (like activate2activate), but I
think they're less than the sum of any other delays we'd be using in
between, making them irrelevant.

Each of those "some constant" is a different number, but it doesn't
matter since it's software's responsibility to compute the right total
delays.

Somehow, all of those must get encoded as sequences of instructions or
state transitions.

Your SRAMs can be configured as 2kx9, 1kx18, or 512x36.  For
programming convenience, we tend not to use the 'parity' bits, but we
can if necessary.  The point is that you can make your instructions
quite wide.  One thing you could do is instead of having implicit PC
increments, encode the next instruction address in the instruction
word.  Of course, for flow control, we'd have to encode two, but I
think we want the flow control to come from outside, based on
instruction flags that indicate which commands are legal at that time.
We multiplex between the "next address" and other entry points based
on the legal commands and their starting addresses.

Say we've just done a read.  The next instruction would indicate that
reads are legal.  But not write or precharge.  Some cycles later,
write becomes legal.  Some cycles later, precharge becomes legal.  If
we issue a command, we lose all state information, but that's okay,
since all timings are reset by the command.

The only problem is activate2precharge, but we could perhaps solve
that by articifially inflating activate2* and/or *2precharge.  That
sucks because we reduce our efficiency, but the idea is to avoid row
misses as much as possible; the more reads and writes you do, the less
the extra delay matters.
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Re: [Open-graphics] Design work: Turning memory controller also into a software problem

Reply via email to