> -----Original Message----- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED]]On Behalf Of Ian Romanick > Sent: Friday, January 17, 2003 2:10 PM > To: DRI developer's list > Subject: Re: [Dri-devel] The next round of texture memory management... > > > Jeff Hartmann wrote: > > >>That may not be possible. Right now the blocks are tracked in the > >>SAREA, and that puts an upper limit on the number of block available. > >>On a 64MB memory region, the current memory manager ends up with 64KB > >>blocks, IIRC. As memories get bigger (both on-card and AGP apertures), > >>the blocks will get bigger. Also right now each block only requires 4 > >>bytes in the SAREA. Any changes that would be made for a new memory > >>manager would make each block require more space, thereby reducing the > >>number of blocks that could fit in the SAREA. > >> > >>Even if we increase the size of the SAREA, a system with 128MB of > >>on-card memory and 128MB AGP aperture would require ~65000 blocks (if > >>each block covered 4KB). > > > > Don't worry too much about this, we can create an entirely > new SAREA to > > hold the memory manager. It can also be rather large, I'm > thinking about > > 128KB or so wouldn't be a problem at all. This will be non swappable > > memory, but thats not too big a deal. Here is what I'm > thinking of as the > > general block format right now, it might not be perfect: > > That works. It should also be possible to have it vary its size > depending on the amount of memory to be managed.
Yeah that shouldn't be too difficult to accomplish. > > [code segment snipped] > > > struct memory_block { > > u32 age_variable; > > u32 status; > > }; > > > > Where the age variable is device dependant, but I would > imagine in most > > cases is a monotonically increasing unsigned 32-bit number. > There needs to > > be a device driver function to check if an age has happened on > the hardware. > > I don't think having an age variable in the shared area is necessary or > sufficient. That's what my original can-swap bit was all about. Each > item that is in a block would have its own age variable / fence. When > all of the age variable / fence conditions were satisfied, the can-swap > bit would be set. Actually I think it is the best way, all you do is put the "greatest" or "latest" age variable in the block description. That way only when we are only done when the last thing is fenced. Makes swap decisions a HELL of alot easier, that way we don't have to have any nasty signal code and age lists all over the place. > > > The status variable has some room, only the bottom 28-bits > are defined at > > the moment. The first 4 bits are some status bits. If > BLOCK_CAN_SWAP is > > set, we can swap this block, swapping requires the driver to > call the kernel > > to swap out this block using some agp method where the contents are > > preserved. Can be accomplished by card DMA. If > BLOCK_LINKS_TO_NEXT is set > > we are part of a group of blocks, which must be treated as a unit. If > > BLOCK_CAN_BE_CLOBBERED is set, the driver can just overwrite > this block of > > memory. If BLOCK_IS_CACHABLE is set we can readback from this > block in a > > fast way, so fallbacks can directly use this block. > > That's interesting. I hadn't considered having kernel intervention to > actually page out blocks. I had alway been on the assumption that all > blocks in AGP or on-card memory were either locked or throw-away. Yeah thats a big important thing here, having some of the operations happen in the kernel allows you to do some really nice things. My main concern is having the logic outside of the kernel, the kernel does some things better then anyone else, and can do things other people can't. As long as the kernel doesn't have to make the decisions and keep around enough information to make the proper decisions I'm happy with the implementation. > > Just like with regular virtual memory, I think we only need to "page > out" pages that we're going to use. I don't think we should need to > page out an entire set of linked pages. Initially we may want to, > though. It wouldn't help much with on-card memory, but with AGP memory > (where we can change mappings), we should be able to do some tricks to > avoid having to do full re-loads. It's also possible that only a subset > of the blocks belonging to an object will have been modified. > > Perhaps what we really need to know for each block is: > > 1. Is the block modified (i.e., by glCopyTexImage)? > 2. What pages in system memory back the block? That is, where are the > parts of the texture in system memory that represent the block in AGP / > on-card memory? Now this is too much information I think. We may want to store which agp key references blocks in some sort of separate way, but I don't know how useful that information would be... I have to do some thinking here. Here is my thoughts about things: 1. We are a particular page inside a particular address space. We only know we are page #n in that address space. We don't care about anything else, our page number is our offset. We would have a card pool and an agp pool. We also have a swapped out pool, but things here probably can't be directly accessed. We need kernel intervention to allow us to access these swapped out things. If the address space is segmented into several mappings we need an address mapping function in the client side 3D driver. Not terrible difficult and we don't have to store too much information. 2. We consider the block or group of blocks as an entire "unit", everything is done on units, not individual pieces of the blocks. That prevents people swapping out the first page of a group of textures and someone having to wait for just that block to come back. 3. Only large agp allocations are swapped out at one time. Little blocks are blitted into a 1MB region, when it is full and the blits are committed, we can decide to swap them out of the agp aperture. This avoids lots of small pages being swapped out thrashing with lots of agp gatt table accesses and potentially causing nasty things like cpu cache flushes too much. 4. Implementations without agp will require at least PCI DMA, or a slow function that copies over the bus. They will have only one pool, and will be considered "swapped" when they aren't in the card pool. If the card supports some sort of PCI-GART we could treat it similarly to agp memory. 5. It might be useful to know some metrics about what kind of memory we are, backbuffer, texture, etc. I'm not sure if we really need to know this information, but it could be useful. > > Hmm...starts to fell like a regular virtual memory system... Hehe, thats about the long and the short of it... > > > The BLOCK_LOG2 stuff is > > a way to pack the usage of this block of memory in just a few > bits. We pack > > log2 - 1, where we only accept usages of 2 bytes or more. Using 2 bytes > > could be considered empty. We can store upto block usage sizes > of 64k in > > this manner. I think that we want 64kb to be our maximum size > for a block. > > That's probably finer granularity than we need. We could probably get > away with "empty", "mostly empty", "half full", "mostly full", and > "full". Admittedly, that only saves one bit, but it removes the > 64KB limit. Sounds okay. > > One thing this is missing is some way to prioritize which blocks are to > be swapped out. Right now the blocks are stored in a LRU linked list, > but I don't think that's necessarilly the best way (the explicit linked > list) to go. Selection might happen LRU or not. The reason for making the age variable public though is so we could perhaps weigh using it. We could also weigh decisions on memory type if we encode that information. Keeping memory type information around also allows us to make private->shared backbuffer / depthbuffer decisions easier and without as much or perhaps any client intervention. There needs to be a selection of which pages to grab next if going in a linear fashion in a clients address space fails. Perhaps it jumps by a preset limit (the normal address space each client carves out for itself) and tries again. Perhaps if something like that fails we fall back to linear age based scanning. Perhaps we keep some sort of freelist based on regions. We can encode a region of 256 megs by page offset and number of 4k pages in a single 32-bit number. Change the page size a little bit and the requirement of bits becomes smaller. I'm thinking at this point and don't have the perfect data structure and logic worked out just yet for the freelist/usedlist part of the memory manager. I suppose though thats what these technical emails are for. Originally at VA I thought to just use something like the Utah-GLX memory manager for the freelist, or perhaps Keith's block memory manager, but just to extend it. I'm not so sure that this is the proper solution though. I guess writing down some of the attributes we want are in order, and trying to think through the problem: 1. We want a method to find out if our pages we messed with, and this should be fast and trivial. Hashing into a bit vector based on id seems like a good thing here. I describe this in detail later in the email when I answer one of your questions. 2. We want a fast method to find a free region if any exist. A free region should be randomly selected or selected with a weight towards the page(s) being close to an address range we specify. A queue, stack or list come to mind here. Lists have such poor performance sometimes though.... Should the lists be stored inside the data blocks?, I don't think so but it might be an implementation that makes sense. I tend to think the "allocation" structures could/should be separate from the pagelist. Here could be some possible freelist implementations: a. Each page is a bit in a bit vector. A set bit means the memory is in use. Find the first zero bit and look for zero bits of a certain number afterwards for a particular sized region. Would have really good performance in the normal case where we are not over committed, and could be made to index to a particular address space easily. There are drawbacks here though, we could potentially use alot of memory with all the bit vectors in our memory manager. Also this doesn't help us too much in the over committed case, we would need something to run through the pagelist and swap out by age in a linear fashion when things get over committed. Or perhaps in a random fashion, dunno. This could happen in a kernel thread or in the Xserver at regular intervals though, so we always attempt to have some room in the freelist. b. We go with a list, queue, or stack. 6 bytes for 256 megs / 4kb pages for a single link, or 8 bytes for 256 megs / 4 kb pages for a double link. We make the head of the list at initialization point to the whole region and we split much like the Utah implementation did. Could be LRU or MRU. Might be faster then previous method unless we want to weigh by address space, then it might get more complicated. Unfortunately this is only kept sorted on age, not where we are in the address space. It could have poor selection of free region performance and things might tend to be grouped and have some bad behavior. Perhaps we keep two separate lists, the allocated list and the free list. c. Something a little more exotic might have better performance. Perhaps keeping a binary tree as a front end to the region lists, that way allowing us to select quickly based on address space. Perhaps slice up the address space into big (say 4 MB) regions and have a list for each region. Perhaps hashing based on region size. I suppose the possibilities are endless here. While something like these ideas might work, I usually go back to the drawing board when I end up with too exotic a solution. Simple and elegant tends to work best in most situations. 3. We want an easy method to grow the memory backing an agp pool, but also some sort of per client restriction, perhaps just the system wide restrictions will do? This should be solved by the agp extension proposal I made earlier in the week. 4. We want a simple way to determine if an age allows us to do something to a texture which has the BLOCK_CAN_BE_CLOBBERED bit set, storing the last age used on a block is all that should be required I think. > > > The bits 27:8 would be a 20-bit number representing a block > id. Each one > > would be unique, so the driver could keep track of what blocks > represent a > > texture. A 20-bit number should be sufficient, since that gives > us like 2 > > million values to work with. > > > > This is a pretty good start for a block format I think. We > want to make > > the memory management SAREA have a lock of its own, shouldn't > be a big deal > > to extend the drm to provide us with one. Or perhaps we use the normal > > device lock when we do any management, I haven't decided yet. There are > > some issues to really think about here. > > > > This sort of implementation needs the kernel to be able to > swap out a block > > from agp memory. The kernel should reserve a portion of the > agp aperture > > for this purpose. Probably on the order of 2-4 MB. Each > allocation of the > > agp aperture should be no smaller then 1MB in size, to prevent > agpgart from > > having to deal with too many blocks of memory. It will also > have to be no > > smaller then the agp_page_shift, in case someone is using 4MB agp pages. > > The kernel will blit with a card specified function the designated block > > from its current position to its final position in the block of > agp memory > > to be swapped. When the ENTIRE block is full, then the kernel will call > > agpgart to swap that region out of the agp aperture. The > kernel will keep > > track of what each swapped out block contains in some manner, > or might brute > > force scan the shared memory area containing the swapped out blocks. > > Okay. There's a few details of this that I'm not seeing. I'm sure > they're there, I'm just not seeing them. > > Process A needs to allocate some blocks (or even just a single block) > for a texture. It scans the list of blocks and finds that not enough > free blocks are available. It performs some hokus-pokus and determines > that a block "owned" by process B needs to be freed. That block has th > BLOCK_CAN_SWAP bit set, but the BLOCK_CAN_BE_CLOBBERED bit is cleared. > > Process A asks the kernel to page the block out. Then what? How does > process B find out that its block was stolen and page it back in? Okay here is how I think things could happen: I want to page the block out, I request to the kernel to return when this list of pages that I give you have been swapped out and are available. If the kernel can immediately process this request, do it and return, if I have to do some dma put the client on a wait queue to be woken up when it happens. The kernel goes ahead and updates the blocks in the SAREA saying that they aren't there (marking their id's as zero perhaps) Process B comes along an sees its textures aren't resident and needs them, it asks the kernel to make them resident somewhere, it doesn't care where. It passes some ID's to the kernel and asks the kernel to make them resident. The kernel puts the process on a waitqueue or returns in a similar fashion to the first request. Whenever we get the lock with contention we must do some sort of quick scanning. We might want to speed up the process somehow, perhaps some sort of hashing by texture number to a dirty flag. Actually that is probably the best implementation. If we reserve 64k of address space to be our dirty flags (backed only when accessed) we can make the dirty flags a bit vector. Considering the texture or block id as an index into this vector we can rapidly find out if our list of textures has been "fooled" with. This prevents us from scanning the entire list, which could be slow. I should also point out that the id's will be reused. We will always attempt to use the smallest id available for use. This way using it as an index into a shared memory area isn't so bad. That way we avoid using lots of memory for nothing when we only have a few texture blocks. > > > There will be a non backed shared memory area that contains > all the swapped > > out pages, the swapped pool it probably a good thing to call > it. Basically > > its a shared memory area, of say 1MB in size that doesn't have any pages > > backing it. It will have a kernel no page function that populates it if > > needed. Basically it will only have information in it if > things are swapped > > out of the aperture. > > > > There needs to be a kernel function which moves a block of > memory into > > cacheable space. We could do with with PCI dma, or some magic > conversion of > > unbound agp pages. This could be made safe, and wouldn't be a > big deal with > > the new agpgart vm stuff. That way the block of agp memory could be > > accessed by a fallback or some other function that needs to > directly read > > the texture. Readback from normal agp memory is horrible, > something on the > > order of 60MB/sec. > > The conversion would probably be better. It would also play nice with > ARB_vertex_array_objects. Also I should point out, on some systems we have the nice ability to have cached agp memory. On these systems we need no conversion, or perhaps just moving the texture into a cachable memory block. On these systems it might even make sense to have all textures marked cachable, but that will take some experimentation. > > Also, how does this all work without AGP? There still are a fair number > of PCI cards out there. :) > > A lot of this is also very Linux specific. What can we do to make as > much of this as possible OS independent? I don't think our BSD friends > will be very happy if we leave them in the cold. :) Linux is most > people's first priority, but it's not the /only/ priority... While it is Linux specific, the modifications and improvements I make to agpgart to make this happen can be ported. I don't think it will require too much more then that, the functions that will plug into the kernel could all be portable much like the rest of the driver code is currently. Some nice additions to agpgart are all that is required to make this possible I think. As for using pci dma or simple copying of card memory to pci memory that would probably be directly portable without any or little effort. -Jeff ------------------------------------------------------- This SF.NET email is sponsored by: Thawte.com - A 128-bit supercerts will allow you to extend the highest allowed 128 bit encryption to all your clients even if they use browsers that are limited to 40 bit encryption. Get a guide here:http://ads.sourceforge.net/cgi-bin/redirect.pl?thaw0030en _______________________________________________ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel