On 5/28/05, Patrick McNamara <[EMAIL PROTECTED]> wrote: > Eric Smith wrote: > > >Patrick wrote: > > > > > >>Ok, here are some assumptions I made. > >> > >>Load/store architecture > >>unified instruction, data, registers. In other words the 512 memory > >>locations contain both code data and registers > >> > >> > > > > > > > >>Looking at the instruction count, I think we can certainly use one of > >>the FPGA 512x36 RAM blocks for the nanocontroller. > >> > >> > > > >The XC3S1500 has 32 of the 18Kbit BlockRAMs, and the XC3S4000 has > >96 of them, so it's probably reasonable to allocate several to the > >nanocontroller to provide flexibility. And after all, it's an FPGA, > >so tweaking the number of BlockRAMs assigned to the nanocontroller > >should only be a matter of changing a few lines of RTL. > > > >The block RAMs have only two ports, so you can't use a single one > >for code, data, and registers. > > > >For a load/store architecture (that doesn't do both simultaneously), > >you might be able to share one block RAM between instructions and > >data. But if pipelining requires that data written by store > >instruction n has to be be written at the same time as data read by a > >load instruction n+1, then a separate block RAM is needed for data > >(or a stall/pipeline bubble). > > > > > So we drop the pipelining. As Timothy has pointed out, it doesn't have > to be fast. Rather than try and pipeline the nanocontroller (which will > be constantly stalled waiting on card memory anyway) lets go the other > way. Assume 1 instruction every 5 clocks for the nanocontroller. That > should give enough stages to allow for a single read or write per clock > cycle. > > Now for the math. Timothy said to expect a 20 clock delay for random > access to card memory. I'm going to assume that is 20 clocks in the > 200Mhz domain.
No. I'm assuming a large part of that delay comes from synchronizing FIFOs between clock domains. These are not a problem when the GPU is streaming, but if you want atomic reads, they suck. And there's very little you can do about it. Furthermore, I'm assuming a row miss in the memory controller, and numberous other delays. Everything's designed for throughput, not latency, and I decided on a rough guess of 20 cycles in the 100 Mhz domain. It might actually be worse! > This means the controller would have to stall for 10 > clocks for each external memory access. At 149000 memory accesses per > screen update that gives us 1.49M clock cycles for memory access. For > the program there are 62 instructions. One set of 13 is looped 64 > times (the blit of the character bitmap). This gives us 881 > instructions per character output times 2000 characters or 1.76M > instructions per screen update. At 5 clocks per instruction we get > 8.81M clocks per screen update for the instructions. A grand total of > 10.3M clocks per screen update or just under 10hz at 100Mhz controller > clock. As long as it's better than 5Hz, we should be alright. But I suspect we can do better than 10 if we do things just right. Again, we make reads non-atomic, where the request and receipt are separate. That will help A LOT. The only way to do it better would be to designe dedicated hardware. > > Of course that cuts in half on an 80x50 screen... Maybe we do need the > pipelining... At 1.2 (the .2 I tossed in for pipeline stalls and > flushes) ipc and 80x25 screen refresh rate of slightly over 27Hz is > possible. > > Food for thought. > > Patrick M > _______________________________________________ > Open-graphics mailing list > [email protected] > http://lists.duskglow.com/mailman/listinfo/open-graphics > List service provided by Duskglow Consulting, LLC (www.duskglow.com) > _______________________________________________ Open-graphics mailing list [email protected] http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
