On 5/26/05, Viktor Pracht <[EMAIL PROTECTED]> wrote:
> Am Mittwoch, den 25.05.2005, 20:32 -0400 schrieb Timothy Miller:
> 
> > I've been thinking about it, and while I really like the idea of
> > instructions being lookup tables in RAM, it may not give us the
> > performance we need.  Things will already be slow.  SO, I suggest we
> > develop a simple processor and use an FPGA RAM block to store both
> > nearly 500 instructions and the register file.
> 
> That "may not give" is not enough. I want real numbers to make it either
> "will give" or "won't give".
> 
> The performance of the nanocontroller is adequate in all cases except
> where a single VGA operation potentially affects the whole framebuffer
> (changing the palette, changing the font etc.), or in text mode, where a
> single write changes up to 1 KB of framebuffer but is expected to be
> very fast. These cases are simply a lot of memory copying, with an
> additional memory read in between (to perform computations on the data).
> That becomes six cached instructions, a couple cached LUTs, and two
> parallel, very predicvite access patterns.
> 
> Since the 3D pipeline is supposed to be able to redraw the whole screen
> at much higher resolutions and framerates than VGA, the memory bandwidth
> can't be the bottleneck. The question now is, how does the cache look
> like, and how can the nanocontroller be designed to use it optimally?
> The ideal case is indeed when the nanocontroller code is inside an FPGA
> RAM block, but it's best when that block is part of the normal cache and
> isn't wasted in non-VGA mode. (And that's true for any kind of VGA
> processor.)
> 
> PS:  Don't worry about the idea of custom instructions. It's nothing
> more than a memory read with indirect addressing. Any processor that is
> capable of looking up colors in the DAC palette is automatically capable
> of that.


Ok, the way reads work in this memory controller, it's designed for
throughput and not latency.  So, streaming reads will be efficient,
but atomic reads will have a latency of AT LEAST 20 clock cycles.  In
the 3D pipeline, the places where this matters have fifos to absorb
the latency.  But in the nanocontroller, it's mostly atomic, and
there's very little you can do to absorb the latency.  Also, since the
memory controller and the nanoprocessor run at different clock rates,
there's additional latency in the cross-domain syncronization.  So, I
figure you'll have a delay of roughly 20 cycles in the 100MHz domain
for the processor for ANY memory read.  While instructions can be
cached, to an extent, the lookup tables cannot be, because the
accesses are totally random.  If we pipeline it properly, that's 20
cycles per instruction, unless the instruction indicates another
memory read, in which case, it's another 20.  (Writes can be ignored.)
 Now, imagine the sort of program that has to be written to convert an
80x25 text mode to graphics.  There are loops and lots of memory reads
and all sorts of stuff.  The throughput's going to be horrible.  If we
can do some estimates on instruction count, we can come up with a
framerate.

_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Reply via email to