On 9 Dec 2007, at 22:57, Timothy Normand Miller wrote:
On Dec 9, 2007 3:54 PM, Michael Meeuwisse <[EMAIL PROTECTED]>
wrote:
When it comes to reads, it may or may not help to have that batch-
size
metadata. At SOME point, we have to break things into individual
word
requests because that's how the memories and memory controllers
work.
Yes. I'm arguing that it's better to do this in the arbiter than in
the agent, because it's easier at a later point to act 'smart'. I
think that figuring out that a dozen addresses can be grouped
together in a single read on a memory controller is much more
expensive than deciding to ungroup a requested block in multiple
reads if it needs to.
I've had other people argue the same point. I'd like more analysis,
but I think I'll go ahead and cave on this one. If the counters are
"in" the arbiter, then their status is available (in a more direct
way) to the scheduler. As long as the read bursts are limited in
length, we can take advantage of that to finish a block before
switching, say, to a higher-priority agent. This will reduce row
misses.
*snip*
What would be really valuable is some simulation or gedanken
experiment that would give us an idea of the probability of two
different agents trying to access the same memory row.
*snip*
I guess.
My idea was to send out a request for one fifo the moment it runs out
of data and another fifo starts supplying data. The arbiter will have
time for as long as the other fifo can provide data. We can put the
address we want (and the block size somehow, say, another queue) in a
queue from the arbiter, and the arbiter can write data back to us as
if we were a fifo. Internally, we'd pass it on to the correct fifo
(this is all in the arbiter's clock domain).
I don't see a reason to have more than one return data fifo. If you
want to detect when the fifo is low, we can just look at how many
entries are in it and decide when to start reading again and fill up
the fifo you're already reading from. They're dual ported; you can
read from the fifo and write to it at the same time. They're circular
queues.
Hmm, excellent point. For some reason I got in my head that two fifos
are easier to manage, and stuck to that.
However, the architecture of our video controller wouldn't allow this
anyhow. Video timing and data fetch are controlled by a continuously
running program. It's another sort of microcontroller, although it
has special loop constructs and no conditional flow control. Read the
docs and you'll see. If we were to separate the "fetch" program from
the "send" program, we could do something akin to what you're
suggesting. However, this video controller has been in use and works
very well. We ensure that video data is available at the right time
by requesting it far enough in advance and giving its requests the
appropriate priority.
I've read most of it by now, and suddenly feel at a loss why we're
even having this discussion. That stuff is great and does a lot of
the management I wanted to do in this video fifo. It also clarifies
why you keep going on about the arbiter, there's not much else to
talk about otherwise. :)
Anyway, if we can define how many entries we want from a certain
address (I mean the FETCH instruction) we can do less fetches in the
video program. Although I'm not sure if that even matters.
The tricky part is that the fifo's will not be very big. There's only
216KB of block ram available, so say that we take for each fifo a two
blocks of 18Kbit. In our highest target resolution (2048 * 1600 * 24,
60Hz) the raster scanner will work through 160.000 full fifos per
second. To get these all filled in time will become quite a strain on
the arbiter.
Not really.
For requests, instead of a fifo, we'll just have four address counters
in the arbiter. The addresses are filled by requests that come from
the video controller, and they have second priority (top is DRAM
refresh). There returns go into four 64-bit wide fifos. The combined
256-bit-wide queue will require 8 of our 96 (?) block RAM modules.
The 512 entries times 256 bits means that the queues can hold up to
4096 pixels, which is longer than any scanline we'll want to scan out.
Done right, we can make this work with even longer ones.
So all the video fifo has to do is be an 256-bit-wide queue? The
video controller will know when 1/xth is used up and can be re-
filled; it'll send a fetch at that point. The arbiter will work away
the addresses received in order.
Why four address counters btw?
I'm not sure how (if at all) this differs from my description. The
only point I'm making is that the queue the data sits in, is in fact
part of the agent. For the addresses going to the memory controllers;
this is all arbiter talk, which sits between us agents and the
controllers. When the data comes back the arbiter kept track of what
address was associated with this data and plays it on to the relevant
agent. Interesting, but not relevant for the video fifo. :)
This is how it works. Note that the way we identify whom a read
belongs to is by tags that travel through the fifo. Each reader is
given a number. When they make a request, their tag number follows
the command through the memory controller pipeline. When the data
comes out, we sync it with the tag. The arbiter (well, some simple
piece of logic anyhow) uses the tag number to determine whose return
queue to put the data word into.
Read requests are, effectively or literally, made by putting
addresses
into one queue. The data comes back through another.
Agreed. Does this queue have data relating to the number of bits we
want from that address? Or will we make another queue for that? Or is
it predefined (which is nasty, as I tried to explain earlier).
Tell me if the "tags" thing above doesn't answer your question.
I really meant a tag to let the arbiter know that we want x bytes
from an associated address. 256 bytes, 512, a few KB, etc.
Note that there are no tristates inside of an FPGA. (Well, there
could hypothetically be, but we never use them.)
You mean my inout usage?
Yes. Inouts are not synthesizable except for external pins.
Oh. Oops. :)
A final thing to add, I mentioned sending a signal a cycle early. I
essentially meant the 'empty' from the fifo, only a clock early. This
way we can switch between the fifos driving the bus to the raster
scanner without the raster scanner ever knowing.
I'm not quite following. Explain to me why you think we need more
than one queue for video data. Maybe that'll help.
Really because it's how I'd do it if I'd have to write it in C. Which
makes no sense in hardware.
--
Timothy Normand Miller
http://www.cse.ohio-state.edu/~millerti
Open Graphics Project
Mike
www.projectvga.org
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)