Ok, here are some assumptions I made.
Load/store architecture
unified instruction, data, registers. In other words the 512 memory
locations contain both code data and registers
No multiply or divide instruction
Every two characters requires the following:
One card memory read to load the two characters and attributes
Two card memory reads to load the foreground attribute RGBA
Two card memory reads to load the background attribute RGBA
144 memory writes to write the font bitmaps to the frame buffer
That works out to 149 memory accesses per two characters or 149000 per
80x25 screen. That's around 582k read or written per screen update.
The rudimentary program I wrote uses 22 registers (I was not exactly
register efficient) and takes 62 instructions for a total of 84 memory
locations. I figure adding in the flexibility in screen resolution (ie
80x50), and blinking and such may double the instruction count.
I'll post the program after I get a chance to read back through it and
make sure I didn't do anything blatantly bone headed (which I'm pretty
sure I did).
Looking at the instruction count, I think we can certainly use one of
the FPGA 512x36 RAM blocks for the nanocontroller. Something we should
think about. Once we move into the ASIC realm a couple of revs down the
line an real estate isn't such a big deal we should consider expanding
the nanocontroller functionality so that it can operate as a
programmable vertex shader.
Patrick M
Timothy Miller wrote:
On 5/26/05, Viktor Pracht <[EMAIL PROTECTED]> wrote:
Am Mittwoch, den 25.05.2005, 20:32 -0400 schrieb Timothy Miller:
I've been thinking about it, and while I really like the idea of
instructions being lookup tables in RAM, it may not give us the
performance we need. Things will already be slow. SO, I suggest we
develop a simple processor and use an FPGA RAM block to store both
nearly 500 instructions and the register file.
That "may not give" is not enough. I want real numbers to make it either
"will give" or "won't give".
The performance of the nanocontroller is adequate in all cases except
where a single VGA operation potentially affects the whole framebuffer
(changing the palette, changing the font etc.), or in text mode, where a
single write changes up to 1 KB of framebuffer but is expected to be
very fast. These cases are simply a lot of memory copying, with an
additional memory read in between (to perform computations on the data).
That becomes six cached instructions, a couple cached LUTs, and two
parallel, very predicvite access patterns.
Since the 3D pipeline is supposed to be able to redraw the whole screen
at much higher resolutions and framerates than VGA, the memory bandwidth
can't be the bottleneck. The question now is, how does the cache look
like, and how can the nanocontroller be designed to use it optimally?
The ideal case is indeed when the nanocontroller code is inside an FPGA
RAM block, but it's best when that block is part of the normal cache and
isn't wasted in non-VGA mode. (And that's true for any kind of VGA
processor.)
PS: Don't worry about the idea of custom instructions. It's nothing
more than a memory read with indirect addressing. Any processor that is
capable of looking up colors in the DAC palette is automatically capable
of that.
Ok, the way reads work in this memory controller, it's designed for
throughput and not latency. So, streaming reads will be efficient,
but atomic reads will have a latency of AT LEAST 20 clock cycles. In
the 3D pipeline, the places where this matters have fifos to absorb
the latency. But in the nanocontroller, it's mostly atomic, and
there's very little you can do to absorb the latency. Also, since the
memory controller and the nanoprocessor run at different clock rates,
there's additional latency in the cross-domain syncronization. So, I
figure you'll have a delay of roughly 20 cycles in the 100MHz domain
for the processor for ANY memory read. While instructions can be
cached, to an extent, the lookup tables cannot be, because the
accesses are totally random. If we pipeline it properly, that's 20
cycles per instruction, unless the instruction indicates another
memory read, in which case, it's another 20. (Writes can be ignored.)
Now, imagine the sort of program that has to be written to convert an
80x25 text mode to graphics. There are loops and lots of memory reads
and all sorts of stuff. The throughput's going to be horrible. If we
can do some estimates on instruction count, we can come up with a
framerate.
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)