On Dec 9, 2007 3:54 PM, Michael Meeuwisse <[EMAIL PROTECTED]> wrote: > > When it comes to reads, it may or may not help to have that batch-size > > metadata. At SOME point, we have to break things into individual word > > requests because that's how the memories and memory controllers work. > > Yes. I'm arguing that it's better to do this in the arbiter than in > the agent, because it's easier at a later point to act 'smart'. I > think that figuring out that a dozen addresses can be grouped > together in a single read on a memory controller is much more > expensive than deciding to ungroup a requested block in multiple > reads if it needs to.
I've had other people argue the same point. I'd like more analysis, but I think I'll go ahead and cave on this one. If the counters are "in" the arbiter, then their status is available (in a more direct way) to the scheduler. As long as the read bursts are limited in length, we can take advantage of that to finish a block before switching, say, to a higher-priority agent. This will reduce row misses. We don't have this metadata available for writes, so we have to make inferences. Additionally, if a read crosses a row boundary, we might want to pause the burst and switch to another agent, if we can be (reasonably) sure that they're wanting to access the same row. Detecting row hits between different agents is something that we don't always have to get right. We just have to get it right _most_ of the time. The information as to which rows are open is buried deep within the memory controller, and since it's a pipeline, at a different time offset, so we can't use it. Instead, the arbiter will have to have its own shadow of that information. There are four memory controllers, and each memory can have a different row open in each of four banks. That's 16 different row addresses to track. What I think might give us a good heuristic on this is to keep track of the bottom four bits (or some simple four-bit hash) of each of those sixteen. Which controller and which bank are deterministic (bottom two and top two address bits, respectively), so we can just look up the 4-bit row indicator and compare that to the same four bits from each address being requested. If one matches, that gets preference. What would be really valuable is some simulation or gedanken experiment that would give us an idea of the probability of two different agents trying to access the same memory row. There ARE cases where we read an address and then later write back to exactly that same address. Alpha blending is one example. However, the time between when a given address is read and when it's written to could be so long that by the time we get a corresponding write, the reads have already long since crossed a row boundary. The only way around that would be to insert artificial delays in anticipation of writes coming in corresponding reads that had just occurred. But do we want to intentionally make the memory controller idle like that? > > My idea was to send out a request for one fifo the moment it runs out > of data and another fifo starts supplying data. The arbiter will have > time for as long as the other fifo can provide data. We can put the > address we want (and the block size somehow, say, another queue) in a > queue from the arbiter, and the arbiter can write data back to us as > if we were a fifo. Internally, we'd pass it on to the correct fifo > (this is all in the arbiter's clock domain). I don't see a reason to have more than one return data fifo. If you want to detect when the fifo is low, we can just look at how many entries are in it and decide when to start reading again and fill up the fifo you're already reading from. They're dual ported; you can read from the fifo and write to it at the same time. They're circular queues. However, the architecture of our video controller wouldn't allow this anyhow. Video timing and data fetch are controlled by a continuously running program. It's another sort of microcontroller, although it has special loop constructs and no conditional flow control. Read the docs and you'll see. If we were to separate the "fetch" program from the "send" program, we could do something akin to what you're suggesting. However, this video controller has been in use and works very well. We ensure that video data is available at the right time by requesting it far enough in advance and giving its requests the appropriate priority. > The tricky part is that the fifo's will not be very big. There's only > 216KB of block ram available, so say that we take for each fifo a two > blocks of 18Kbit. In our highest target resolution (2048 * 1600 * 24, > 60Hz) the raster scanner will work through 160.000 full fifos per > second. To get these all filled in time will become quite a strain on > the arbiter. Not really. For requests, instead of a fifo, we'll just have four address counters in the arbiter. The addresses are filled by requests that come from the video controller, and they have second priority (top is DRAM refresh). There returns go into four 64-bit wide fifos. The combined 256-bit-wide queue will require 8 of our 96 (?) block RAM modules. The 512 entries times 256 bits means that the queues can hold up to 4096 pixels, which is longer than any scanline we'll want to scan out. Done right, we can make this work with even longer ones. > I'm not sure how (if at all) this differs from my description. The > only point I'm making is that the queue the data sits in, is in fact > part of the agent. For the addresses going to the memory controllers; > this is all arbiter talk, which sits between us agents and the > controllers. When the data comes back the arbiter kept track of what > address was associated with this data and plays it on to the relevant > agent. Interesting, but not relevant for the video fifo. :) This is how it works. Note that the way we identify whom a read belongs to is by tags that travel through the fifo. Each reader is given a number. When they make a request, their tag number follows the command through the memory controller pipeline. When the data comes out, we sync it with the tag. The arbiter (well, some simple piece of logic anyhow) uses the tag number to determine whose return queue to put the data word into. > > Read requests are, effectively or literally, made by putting addresses > > into one queue. The data comes back through another. > > Agreed. Does this queue have data relating to the number of bits we > want from that address? Or will we make another queue for that? Or is > it predefined (which is nasty, as I tried to explain earlier). Tell me if the "tags" thing above doesn't answer your question. > > Note that there are no tristates inside of an FPGA. (Well, there > > could hypothetically be, but we never use them.) > > You mean my inout usage? Yes. Inouts are not synthesizable except for external pins. > A final thing to add, I mentioned sending a signal a cycle early. I > essentially meant the 'empty' from the fifo, only a clock early. This > way we can switch between the fifos driving the bus to the raster > scanner without the raster scanner ever knowing. I'm not quite following. Explain to me why you think we need more than one queue for video data. Maybe that'll help. -- Timothy Normand Miller http://www.cse.ohio-state.edu/~millerti Open Graphics Project _______________________________________________ Open-graphics mailing list [email protected] http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
