On 5/26/05, Viktor Pracht <[EMAIL PROTECTED]> wrote: > Am Mittwoch, den 25.05.2005, 20:32 -0400 schrieb Timothy Miller: > > > I've been thinking about it, and while I really like the idea of > > instructions being lookup tables in RAM, it may not give us the > > performance we need. Things will already be slow. SO, I suggest we > > develop a simple processor and use an FPGA RAM block to store both > > nearly 500 instructions and the register file. > > That "may not give" is not enough. I want real numbers to make it either > "will give" or "won't give". > > The performance of the nanocontroller is adequate in all cases except > where a single VGA operation potentially affects the whole framebuffer > (changing the palette, changing the font etc.), or in text mode, where a > single write changes up to 1 KB of framebuffer but is expected to be > very fast. These cases are simply a lot of memory copying, with an > additional memory read in between (to perform computations on the data). > That becomes six cached instructions, a couple cached LUTs, and two > parallel, very predicvite access patterns. > > Since the 3D pipeline is supposed to be able to redraw the whole screen > at much higher resolutions and framerates than VGA, the memory bandwidth > can't be the bottleneck. The question now is, how does the cache look > like, and how can the nanocontroller be designed to use it optimally? > The ideal case is indeed when the nanocontroller code is inside an FPGA > RAM block, but it's best when that block is part of the normal cache and > isn't wasted in non-VGA mode. (And that's true for any kind of VGA > processor.) > > PS: Don't worry about the idea of custom instructions. It's nothing > more than a memory read with indirect addressing. Any processor that is > capable of looking up colors in the DAC palette is automatically capable > of that.
Ok, the way reads work in this memory controller, it's designed for throughput and not latency. So, streaming reads will be efficient, but atomic reads will have a latency of AT LEAST 20 clock cycles. In the 3D pipeline, the places where this matters have fifos to absorb the latency. But in the nanocontroller, it's mostly atomic, and there's very little you can do to absorb the latency. Also, since the memory controller and the nanoprocessor run at different clock rates, there's additional latency in the cross-domain syncronization. So, I figure you'll have a delay of roughly 20 cycles in the 100MHz domain for the processor for ANY memory read. While instructions can be cached, to an extent, the lookup tables cannot be, because the accesses are totally random. If we pipeline it properly, that's 20 cycles per instruction, unless the instruction indicates another memory read, in which case, it's another 20. (Writes can be ignored.) Now, imagine the sort of program that has to be written to convert an 80x25 text mode to graphics. There are loops and lots of memory reads and all sorts of stuff. The throughput's going to be horrible. If we can do some estimates on instruction count, we can come up with a framerate. _______________________________________________ Open-graphics mailing list [email protected] http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
