Re: [Open-graphics] Mad dash to finish VGA by Jan 7 -- who's with me?

Timothy Normand Miller Sun, 09 Dec 2007 14:03:40 -0800

On Dec 9, 2007 3:54 PM, Michael Meeuwisse <[EMAIL PROTECTED]> wrote:
> > When it comes to reads, it may or may not help to have that batch-size
> > metadata.  At SOME point, we have to break things into individual word
> > requests because that's how the memories and memory controllers work.
>
> Yes. I'm arguing that it's better to do this in the arbiter than in
> the agent, because it's easier at a later point to act 'smart'. I
> think that figuring out that a dozen addresses can be grouped
> together in a single read on a memory controller is much more
> expensive than deciding to ungroup a requested block in multiple
> reads if it needs to.


I've had other people argue the same point.  I'd like more analysis,
but I think I'll go ahead and cave on this one.  If the counters are
"in" the arbiter, then their status is available (in a more direct
way) to the scheduler.  As long as the read bursts are limited in
length, we can take advantage of that to finish a block before
switching, say, to a higher-priority agent.  This will reduce row
misses.

We don't have this metadata available for writes, so we have to make
inferences.  Additionally, if a read crosses a row boundary, we might
want to pause the burst and switch to another agent, if we can be
(reasonably) sure that they're wanting to access the same row.


Detecting row hits between different agents is something that we don't
always have to get right.  We just have to get it right _most_ of the
time.  The information as to which rows are open is buried deep within
the memory controller, and since it's a pipeline, at a different time
offset, so we can't use it.  Instead, the arbiter will have to have
its own shadow of that information.  There are four memory
controllers, and each memory can have a different row open in each of
four banks.  That's 16 different row addresses to track.  What I think
might give us a good heuristic on this is to keep track of the bottom
four bits (or some simple four-bit hash) of each of those sixteen.
Which controller and which bank are deterministic (bottom two and top
two address bits, respectively), so we can just look up the 4-bit row
indicator and compare that to the same four bits from each address
being requested.  If one matches, that gets preference.

What would be really valuable is some simulation or gedanken
experiment that would give us an idea of the probability of two
different agents trying to access the same memory row.  There ARE
cases where we read an address and then later write back to exactly
that same address.  Alpha blending is one example.  However, the time
between when a given address is read and when it's written to could be
so long that by the time we get a corresponding write, the reads have
already long since crossed a row boundary.  The only way around that
would be to insert artificial delays in anticipation of writes coming
in corresponding reads that had just occurred.  But do we want to
intentionally make the memory controller idle like that?

>
> My idea was to send out a request for one fifo the moment it runs out
> of data and another fifo starts supplying data. The arbiter will have
> time for as long as the other fifo can provide data. We can put the
> address we want (and the block size somehow, say, another queue) in a
> queue from the arbiter, and the arbiter can write data back to us as
> if we were a fifo. Internally, we'd pass it on to the correct fifo
> (this is all in the arbiter's clock domain).

I don't see a reason to have more than one return data fifo.  If you
want to detect when the fifo is low, we can just look at how many
entries are in it and decide when to start reading again and fill up
the fifo you're already reading from.  They're dual ported; you can
read from the fifo and write to it at the same time.  They're circular
queues.

However, the architecture of our video controller wouldn't allow this
anyhow.  Video timing and data fetch are controlled by a continuously
running program.  It's another sort of microcontroller, although it
has special loop constructs and no conditional flow control.  Read the
docs and you'll see.  If we were to separate the "fetch" program from
the "send" program, we could do something akin to what you're
suggesting.  However, this video controller has been in use and works
very well.  We ensure that video data is available at the right time
by requesting it far enough in advance and giving its requests the
appropriate priority.

> The tricky part is that the fifo's will not be very big. There's only
> 216KB of block ram available, so say that we take for each fifo a two
> blocks of 18Kbit. In our highest target resolution (2048 * 1600 * 24,
> 60Hz) the raster scanner will work through 160.000 full fifos per
> second. To get these all filled in time will become quite a strain on
> the arbiter.

Not really.

For requests, instead of a fifo, we'll just have four address counters
in the arbiter.  The addresses are filled by requests that come from
the video controller, and they have second priority (top is DRAM
refresh).  There returns go into four 64-bit wide fifos.  The combined
256-bit-wide queue will require 8 of our 96 (?) block RAM modules.
The 512 entries times 256 bits means that the queues can hold up to
4096 pixels, which is longer than any scanline we'll want to scan out.
 Done right, we can make this work with even longer ones.

> I'm not sure how (if at all) this differs from my description. The
> only point I'm making is that the queue the data sits in, is in fact
> part of the agent. For the addresses going to the memory controllers;
> this is all arbiter talk, which sits between us agents and the
> controllers. When the data comes back the arbiter kept track of what
> address was associated with this data and plays it on to the relevant
> agent. Interesting, but not relevant for the video fifo. :)

This is how it works.  Note that the way we identify whom a read
belongs to is by tags that travel through the fifo.  Each reader is
given a number.  When they make a request, their tag number follows
the command through the memory controller pipeline.  When the data
comes out, we sync it with the tag.  The arbiter (well, some simple
piece of logic anyhow) uses the tag number to determine whose return
queue to put the data word into.

> > Read requests are, effectively or literally, made by putting addresses
> > into one queue.  The data comes back through another.
>
> Agreed. Does this queue have data relating to the number of bits we
> want from that address? Or will we make another queue for that? Or is
> it predefined (which is nasty, as I tried to explain earlier).

Tell me if the "tags" thing above doesn't answer your question.

> > Note that there are no tristates inside of an FPGA.  (Well, there
> > could hypothetically be, but we never use them.)
>
> You mean my inout usage?

Yes.  Inouts are not synthesizable except for external pins.

> A final thing to add, I mentioned sending a signal a cycle early. I
> essentially meant the 'empty' from the fifo, only a clock early. This
> way we can switch between the fifos driving the bus to the raster
> scanner without the raster scanner ever knowing.

I'm not quite following.  Explain to me why you think we need more
than one queue for video data.  Maybe that'll help.


-- 
Timothy Normand Miller
http://www.cse.ohio-state.edu/~millerti
Open Graphics Project
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Re: [Open-graphics] Mad dash to finish VGA by Jan 7 -- who's with me?

Reply via email to