On 5/28/05, Patrick McNamara <[EMAIL PROTECTED]> wrote:
> Eric Smith wrote:
> 
> >Patrick wrote:
> >
> >
> >>Ok, here are some assumptions I made.
> >>
> >>Load/store architecture
> >>unified instruction, data, registers.  In other words the 512 memory
> >>locations contain both code data and registers
> >>
> >>
> >
> >
> >
> >>Looking at the instruction count, I think we can certainly use one of
> >>the FPGA 512x36 RAM blocks for the nanocontroller.
> >>
> >>
> >
> >The XC3S1500 has 32 of the 18Kbit BlockRAMs, and the XC3S4000 has
> >96 of them, so it's probably reasonable to allocate several to the
> >nanocontroller to provide flexibility.  And after all, it's an FPGA,
> >so tweaking the number of BlockRAMs assigned to the nanocontroller
> >should only be a matter of changing a few lines of RTL.
> >
> >The block RAMs have only two ports, so you can't use a single one
> >for code, data, and registers.
> >
> >For a load/store architecture (that doesn't do both simultaneously),
> >you might be able to share one block RAM between instructions and
> >data.  But if pipelining requires that data written by store
> >instruction n has to be be written at the same time as data read by a
> >load instruction n+1, then a separate block RAM is needed for data
> >(or a stall/pipeline bubble).
> >
> >
> So we drop the pipelining.  As Timothy has pointed out, it doesn't have
> to be fast.  Rather than try and pipeline the nanocontroller (which will
> be constantly stalled waiting on card memory anyway) lets go the other
> way.  Assume 1 instruction every 5 clocks for the nanocontroller.  That
> should give enough stages to allow for a single read or write per clock
> cycle.
> 
> Now for the math.  Timothy said to expect a 20 clock delay for random
> access to card memory.  I'm going to assume that is 20 clocks in the
> 200Mhz domain.  

No.  I'm assuming a large part of that delay comes from synchronizing
FIFOs between clock domains.  These are not a problem when the GPU is
streaming, but if you want atomic reads, they suck.  And there's very
little you can do about it.  Furthermore, I'm assuming a row miss in
the memory controller, and numberous other delays.  Everything's
designed for throughput, not latency, and I decided on a rough guess
of 20 cycles in the 100 Mhz domain.  It might actually be worse!

> This means the controller would have to stall for 10
> clocks for each external memory access.  At 149000 memory accesses per
> screen update that gives us 1.49M clock cycles for memory access.  For
> the program there are 62 instructions.  One set of 13 is looped  64
> times (the blit of the character bitmap).  This gives us 881
> instructions per character output times 2000 characters or 1.76M
> instructions per screen update.  At 5 clocks per instruction we get
> 8.81M clocks per screen update for the instructions.  A grand total of
> 10.3M clocks per screen update or just under 10hz at 100Mhz controller
> clock.

As long as it's better than 5Hz, we should be alright.  But I suspect
we can do better than 10 if we do things just right.  Again, we make
reads non-atomic, where the request and receipt are separate.  That
will help A LOT.  The only way to do it better would be to designe
dedicated hardware.

> 
> Of course that cuts in half on an 80x50 screen...  Maybe we do need the
> pipelining...  At 1.2 (the .2 I tossed in for pipeline stalls and
> flushes) ipc and 80x25 screen refresh rate of slightly over 27Hz is
> possible.
> 
> Food for thought.
> 
> Patrick M
> _______________________________________________
> Open-graphics mailing list
> [email protected]
> http://lists.duskglow.com/mailman/listinfo/open-graphics
> List service provided by Duskglow Consulting, LLC (www.duskglow.com)
>

_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Reply via email to