Jeff Hartmann wrote:
That's not quite what I meant. Imaging the user has 40MB of on-card memory available for textures. If that user runs an application that uses two 2048x2048x32bpp textures per-frame (each one weighting in at ~22MB), we will either have to use AGP memory for one of them, or (if the card is PCI with no PCIGART) it will have to copy 44MB across the PCI bus every single frame.2. We consider the block or group of blocks as an entire "unit", everything is done on units, not individual pieces of the blocks. That prevents people swapping out the first page of a group of textures and someone having to wait for just that block to come back.I believe that the block should be unit used. If each block has a group ID (the IDs that you talk about below) and a sequence number, we can do some very nice optimizations. Imagine a case where we have two textures that use 51% of the available texture space. Performance would DIE if we had to bring in the entire texture every single time. We can do a little optimization and only bring in 2% of texture memory each time instead of 102%.Just a slight comment here, if the memory has actually made it out of the agp aperture no matter how big the page is the cost of getting it back is the same. Course I really like the idea of storing sequence numbers though, gives room for lots of flexibility. And for situations where we haven't fully made it out of the agp aperture, the cost of the blit is much smaller.
In reality, to fit the second texture in on-card memory, we only have to reclaim 4MB. We could then hit a steady state where 18MB of each texture never moves out of on-card memory, and only 8MB of texture needs to be copied in each frame. If we view a texture (or any object) as an ordered sequence of blocks instead of a monolithic lump, we can make that optimization. Based on what you have written below, I think we're in agreement that this is the way to go. :)
[big snip]
It would take some experiments to prove it, but I believe that simple LRU or MRU is always suboptimal in both the theoretical and practical case.One their own (a) and (b) are too simple. That won't give us enough functionality to do all the things we want. The memory bit-map has the advantage that we can pick the largest free block and try to reclaim memory around that free block until it is large enough to satisfy the request. After having tried to do that with the existing linked list approach, I can honestly say that a linked list is a very poor datastructure for that purpose.On the flip side, the bit-map doesn't store much useful information about the allocated blocks. That makes it difficult to select which memory to reclaim when memory is very full. The problem is that BOTH of these situations are important performance cases!Okay I have done some thinking, and I think I have a pretty good solution. Under normal cases we try and get a completely unused portion of memory by using the bit vector freelist. If one is unavailable and we can't grow agp memory for some reason we fall back onto another data structure, which is a priority heap (priority queue or binary heap depending on the data structures book you look at.) That should have good performance and makes more sense then a linked list I think, as far as size is concerned. We could implement a heap with just an array of index values into the pagelist. We might get some vampiric(sp?) performance issues from just about every operation on the heap being log n. However our heap should be of a small enough height so this won't make too much a difference over a linked list. We should make this a min style heap, where age is the value used for the key in the heap. That way the top of the heap is always the texture block or group with the minimum age (longest since it was last used by the card). We want to avoid ever doing a search of the heap, so we store the index into the heap with the pagelist. When we rearrange something in the heap we make sure we update these indices. Perhaps we want to do a different selection then on longest since last use, but I think that might give us reasonable performance. I suppose MRU would be trivial if its a max heap, so that sort of thing would be pretty easy to implement with this data structure as well. I know you get different answers depending on who you ask, but whats the replacement strategy you would recommend?
One example where any type of priority queue fails is where you need one more block than is available in the largest free region. If the largest free region is 54 blocks and 55 blocks are needed, the optimal sollution is most likely to reclaim one of the used blocks at the head or tail of the free region.
I think we'd also prefer to reclaim blocks that don't need to be swapped (i.e., have the throw-away bit set).
There are some cases, depending on the number of blocks needed, where we'd also prefer to reclaim blocks that aren't part of a sequence.
We may well settle on using a slight variation of simple LRU or MRU. I'd like to see our initial implementation allow a little more flexability to experiment with gathering different heuristics to improve performance.
[another big snip]
I got to thinking about this, and I came up with another sollution. When process B is going to reclaim blocks from process A, process B marks the reclaimed blocks with its block ID and clears the can-swap bit. It also clears the can-swap bit on other blocks (that it already has) that are part of the sequence. When process B calls into the kernel, it passes in the block number and the old block ID & sequence information. This prevents process A from getting the CPU and trying to take the blocks back.Up to this point, I follow you. Here is my problem. Say we have 16KB blocks. Say process B had a single block that had 4 vertex buffers in. The first 3 are 1KB and the last one is 2KB (and hangs over into the next block). This first block is the one that process A selected to swap-out. Is the ID number assigned to the first block (the one that was swapped-out) and the second block (that wasn't swapped-out) the same? It seems like it should be (and that would enable the kernel to shuffle things around and keep blocks contiguous), but it seems like it would be difficult (or at least irritating) to keep all the block IDs correct (as subregions in the blocks are allocated and freed by a process). When process B goes to sleep waiting for its blocks to come back, what locks will it hold? If it doesn't hold any, how do we prevent process A from coming back and stealing the second block?We don't need to hold any locks when we sleep, but since we have to ask the kernel for the block back it can decide who to wake up and who to put to sleep. We can just wake in a FIFO fashion from the wait queue. We can also perhaps make a simple decision in the kernel. If it sees that there is alot of contention on one area then it could "recommend" another area in some fashion to put its textures.
[snip]
How would we dole out IDs? Would there be a kernel call to get / release a set of IDs?Probably, however there are methods to do this without kernel intervention. The good ole' bit vector is great for doling out keys. ;)
Oh. Duh. :) ------------------------------------------------------- This SF.NET email is sponsored by: FREE SSL Guide from Thawte are you planning your Web Server Security? Click here to get a FREE Thawte SSL guide and find the answers to all your SSL security issues. http://ads.sourceforge.net/cgi-bin/redirect.pl?thaw0026en _______________________________________________ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel