RE: [Dri-devel] The next round of texture memory management...

Jeff Hartmann Fri, 17 Jan 2003 15:49:54 -0800


> -----Original Message-----
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED]]On Behalf Of Ian Romanick
> Sent: Friday, January 17, 2003 2:10 PM
> To: DRI developer's list
> Subject: Re: [Dri-devel] The next round of texture memory management...
>
>
> Jeff Hartmann wrote:
>
> >>That may not be possible.  Right now the blocks are tracked in the
> >>SAREA, and that puts an upper limit on the number of block available.
> >>On a 64MB memory region, the current memory manager ends up with 64KB
> >>blocks, IIRC.  As memories get bigger (both on-card and AGP apertures),
> >>the blocks will get bigger.  Also right now each block only requires 4
> >>bytes in the SAREA.  Any changes that would be made for a new memory
> >>manager would make each block require more space, thereby reducing the
> >>number of blocks that could fit in the SAREA.
> >>
> >>Even if we increase the size of the SAREA, a system with 128MB of
> >>on-card memory and 128MB AGP aperture would require ~65000 blocks (if
> >>each block covered 4KB).
> >
> >     Don't worry too much about this, we can create an entirely
> new SAREA to
> > hold the memory manager.  It can also be rather large, I'm
> thinking about
> > 128KB or so wouldn't be a problem at all.  This will be non swappable
> > memory, but thats not too big a deal.  Here is what I'm
> thinking of as the
> > general block format right now, it might not be perfect:
>
> That works.  It should also be possible to have it vary its size
> depending on the amount of memory to be managed.


Yeah that shouldn't be too difficult to accomplish.

>
> [code segment snipped]
>
> > struct memory_block {
> >     u32     age_variable;
> >     u32     status;
> > };
> >
> >     Where the age variable is device dependant, but I would
> imagine in most
> > cases is a monotonically increasing unsigned 32-bit number.
> There needs to
> > be a device driver function to check if an age has happened on
> the hardware.
>
> I don't think having an age variable in the shared area is necessary or
> sufficient.  That's what my original can-swap bit was all about.  Each
> item that is in a block would have its own age variable / fence.  When
> all of the age variable / fence conditions were satisfied, the can-swap
> bit would be set.

Actually I think it is the best way, all you do is put the "greatest" or
"latest" age variable in the block description.  That way only when we are
only done when the last thing is fenced.  Makes swap decisions a HELL of
alot easier, that way we don't have to have any nasty signal code and age
lists all over the place.

>
> >     The status variable has some room, only the bottom 28-bits
> are defined at
> > the moment.  The first 4 bits are some status bits.  If
> BLOCK_CAN_SWAP is
> > set, we can swap this block, swapping requires the driver to
> call the kernel
> > to swap out this block using some agp method where the contents are
> > preserved.  Can be accomplished by card DMA.  If
> BLOCK_LINKS_TO_NEXT is set
> > we are part of a group of blocks, which must be treated as a unit.  If
> > BLOCK_CAN_BE_CLOBBERED is set, the driver can just overwrite
> this block of
> > memory.  If BLOCK_IS_CACHABLE is set we can readback from this
> block in a
> > fast way, so fallbacks can directly use this block.
>
> That's interesting.  I hadn't considered having kernel intervention to
> actually page out blocks.  I had alway been on the assumption that all
> blocks in AGP or on-card memory were either locked or throw-away.

Yeah thats a big important thing here, having some of the operations happen
in the kernel allows you to do some really nice things.  My main concern is
having the logic outside of the kernel, the kernel does some things better
then anyone else, and can do things other people can't.  As long as the
kernel doesn't have to make the decisions and keep around enough information
to make the proper decisions I'm happy with the implementation.

>
> Just like with regular virtual memory, I think we only need to "page
> out" pages that we're going to use.  I don't think we should need to
> page out an entire set of linked pages.  Initially we may want to,
> though.  It wouldn't help much with on-card memory, but with AGP memory
> (where we can change mappings), we should be able to do some tricks to
> avoid having to do full re-loads.  It's also possible that only a subset
> of the blocks belonging to an object will have been modified.
>
> Perhaps what we really need to know for each block is:
>
> 1. Is the block modified (i.e., by glCopyTexImage)?
> 2. What pages in system memory back the block?  That is, where are the
> parts of the texture in system memory that represent the block in AGP /
> on-card memory?

Now this is too much information I think.  We may want to store which agp
key references blocks in some sort of separate way, but I don't know how
useful that information would be...  I have to do some thinking here.  Here
is my thoughts about things:

1. We are a particular page inside a particular address space.  We only know
we are page #n in that address space.  We don't care about anything else,
our page number is our offset.  We would have a card pool and an agp pool.
We also have a swapped out pool, but things here probably can't be directly
accessed.  We need kernel intervention to allow us to access these swapped
out things.  If the address space is segmented into several mappings we need
an address mapping function in the client side 3D driver.  Not terrible
difficult and we don't have to store too much information.

2. We consider the block or group of blocks as an entire "unit", everything
is done on units, not individual pieces of the blocks.  That prevents people
swapping out the first page of a group of textures and someone having to
wait for just that block to come back.

3. Only large agp allocations are swapped out at one time.  Little blocks
are blitted into a 1MB region, when it is full and the blits are committed,
we can decide to swap them out of the agp aperture.  This avoids lots of
small pages being swapped out thrashing with lots of agp gatt table accesses
and potentially causing nasty things like cpu cache flushes too much.

4. Implementations without agp will require at least PCI DMA, or a slow
function that copies over the bus.  They will have only one pool, and will
be considered "swapped" when they aren't in the card pool.  If the card
supports some sort of PCI-GART we could treat it similarly to agp memory.

5. It might be useful to know some metrics about what kind of memory we are,
backbuffer, texture, etc.  I'm not sure if we really need to know this
information, but it could be useful.

>
> Hmm...starts to fell like a regular virtual memory system...

Hehe, thats about the long and the short of it...

>
>  > The BLOCK_LOG2 stuff is
> > a way to pack the usage of this block of memory in just a few
> bits.  We pack
> > log2 - 1, where we only accept usages of 2 bytes or more.  Using 2 bytes
> > could be considered empty.  We can store upto block usage sizes
> of 64k in
> > this manner.  I think that we want 64kb to be our maximum size
> for a block.
>
> That's probably finer granularity than we need.  We could probably get
> away with "empty", "mostly empty", "half full", "mostly full", and
> "full".  Admittedly, that only saves one bit, but it removes the
> 64KB limit.

Sounds okay.

>
> One thing this is missing is some way to prioritize which blocks are to
> be swapped out.  Right now the blocks are stored in a LRU linked list,
> but I don't think that's necessarilly the best way (the explicit linked
> list) to go.

Selection might happen LRU or not.  The reason for making the age variable
public though is so we could perhaps weigh using it.  We could also weigh
decisions on memory type if we encode that information.  Keeping memory type
information around also allows us to make private->shared backbuffer /
depthbuffer decisions easier and without as much or perhaps any client
intervention.

There needs to be a selection of which pages to grab next if going in a
linear fashion in a clients address space fails.  Perhaps it jumps by a
preset limit (the normal address space each client carves out for itself)
and tries again.  Perhaps if something like that fails we fall back to
linear age based scanning.  Perhaps we keep some sort of freelist based on
regions.  We can encode a region of 256 megs by page offset and number of 4k
pages in a single 32-bit number.  Change the page size a little bit and the
requirement of bits becomes smaller.

I'm thinking at this point and don't have the perfect data structure and
logic worked out just yet for the freelist/usedlist part of the memory
manager.  I suppose though thats what these technical emails are for.
Originally at VA I thought to just use something like the Utah-GLX memory
manager for the freelist, or perhaps Keith's block memory manager, but just
to extend it.  I'm not so sure that this is the proper solution though.

I guess writing down some of the attributes we want are in order, and trying
to think through the problem:
1. We want a method to find out if our pages we messed with, and this should
be fast and trivial.  Hashing into a bit vector based on id seems like a
good thing here.  I describe this in detail later in the email when I answer
one of your questions.

2. We want a fast method to find a free region if any exist.  A free region
should be randomly selected or selected with a weight towards the page(s)
being close to an address range we specify.  A queue, stack or list come to
mind here.  Lists have such poor performance sometimes though....  Should
the lists be stored inside the data blocks?, I don't think so but it might
be an implementation that makes sense.  I tend to think the "allocation"
structures could/should be separate from the pagelist.

Here could be some possible freelist implementations:
a. Each page is a bit in a bit vector.  A set bit means the memory is in
use.  Find the first zero bit and look for zero bits of a certain number
afterwards for a particular sized region.  Would have really good
performance in the normal case where we are not over committed, and could be
made to index to a particular address space easily.  There are drawbacks
here though, we could potentially use alot of memory with all the bit
vectors in our memory manager.  Also this doesn't help us too much in the
over committed case, we would need something to run through the pagelist and
swap out by age in a linear fashion when things get over committed.  Or
perhaps in a random fashion, dunno.  This could happen in a kernel thread or
in the Xserver at regular intervals though, so we always attempt to have
some room in the freelist.

b. We go with a list, queue, or stack.  6 bytes for 256 megs / 4kb pages for
a single link, or 8 bytes for 256 megs / 4 kb pages for a double link.  We
make the head of the list at initialization point to the whole region and we
split much like the Utah implementation did.  Could be LRU or MRU.  Might be
faster then previous method unless we want to weigh by address space, then
it might get more complicated.  Unfortunately this is only kept sorted on
age, not where we are in the address space.  It could have poor selection of
free region performance and things might tend to be grouped and have some
bad behavior.  Perhaps we keep two separate lists, the allocated list and
the free list.

c. Something a little more exotic might have better performance.  Perhaps
keeping a binary tree as a front end to the region lists, that way allowing
us to select quickly based on address space.  Perhaps slice up the address
space into big (say 4 MB) regions and have a list for each region.  Perhaps
hashing based on region size.  I suppose the possibilities are endless here.
While something like these ideas might work, I usually go back to the
drawing board when I end up with too exotic a solution.  Simple and elegant
tends to work best in most situations.

3. We want an easy method to grow the memory backing an agp pool, but also
some sort of per client restriction, perhaps just the system wide
restrictions will do?  This should be solved by the agp extension proposal I
made earlier in the week.

4. We want a simple way to determine if an age allows us to do something to
a texture which has the BLOCK_CAN_BE_CLOBBERED bit set, storing the last age
used on a block is all that should be required I think.

>
> >     The bits 27:8 would be a 20-bit number representing a block
> id.  Each one
> > would be unique, so the driver could keep track of what blocks
> represent a
> > texture.  A 20-bit number should be sufficient, since that gives
> us like 2
> > million values to work with.
>  >
> >     This is a pretty good start for a block format I think.  We
> want to make
> > the memory management SAREA have a lock of its own, shouldn't
> be a big deal
> > to extend the drm to provide us with one.  Or perhaps we use the normal
> > device lock when we do any management, I haven't decided yet.  There are
> > some issues to really think about here.
> >
> >     This sort of implementation needs the kernel to be able to
> swap out a block
> > from agp memory.  The kernel should reserve a portion of the
> agp aperture
> > for this purpose.  Probably on the order of 2-4 MB.  Each
> allocation of the
> > agp aperture should be no smaller then 1MB in size, to prevent
> agpgart from
> > having to deal with too many blocks of memory.  It will also
> have to be no
> > smaller then the agp_page_shift, in case someone is using 4MB agp pages.
> > The kernel will blit with a card specified function the designated block
> > from its current position to its final position in the block of
> agp memory
> > to be swapped.  When the ENTIRE block is full, then the kernel will call
> > agpgart to swap that region out of the agp aperture.  The
> kernel will keep
> > track of what each swapped out block contains in some manner,
> or might brute
> > force scan the shared memory area containing the swapped out blocks.
>
> Okay.  There's a few details of this that I'm not seeing.  I'm sure
> they're there, I'm just not seeing them.
>
> Process A needs to allocate some blocks (or even just a single block)
> for a texture.  It scans the list of blocks and finds that not enough
> free blocks are available.  It performs some hokus-pokus and determines
> that a block "owned" by process B needs to be freed.  That block has th
> BLOCK_CAN_SWAP bit set, but the BLOCK_CAN_BE_CLOBBERED bit is cleared.
>
> Process A asks the kernel to page the block out.  Then what?  How does
> process B find out that its block was stolen and page it back in?

Okay here is how I think things could happen:

I want to page the block out, I request to the kernel to return when this
list of pages that I give you have been swapped out and are available.  If
the kernel can immediately process this request, do it and return, if I have
to do some dma put the client on a wait queue to be woken up when it
happens.

The kernel goes ahead and updates the blocks in the SAREA saying that they
aren't there (marking their id's as zero perhaps)

Process B comes along an sees its textures aren't resident and needs them,
it asks the kernel to make them resident somewhere, it doesn't care where.
It passes some ID's to the kernel and asks the kernel to make them resident.
The kernel puts the process on a waitqueue or returns in a similar fashion
to the first request.

Whenever we get the lock with contention we must do some sort of quick
scanning.  We might want to speed up the process somehow, perhaps some sort
of hashing by texture number to a dirty flag.  Actually that is probably the
best implementation.  If we reserve 64k of address space to be our dirty
flags (backed only when accessed) we can make the dirty flags a bit vector.
Considering the texture or block id as an index into this vector we can
rapidly find out if our list of textures has been "fooled" with.  This
prevents us from scanning the entire list, which could be slow.

I should also point out that the id's will be reused.  We will always
attempt to use the smallest id available for use.  This way using it as an
index into a shared memory area isn't so bad.  That way we avoid using lots
of memory for nothing when we only have a few texture blocks.

>
> >     There will be a non backed shared memory area that contains
> all the swapped
> > out pages, the swapped pool it probably a good thing to call
> it.  Basically
> > its a shared memory area, of say 1MB in size that doesn't have any pages
> > backing it.  It will have a kernel no page function that populates it if
> > needed.  Basically it will only have information in it if
> things are swapped
> > out of the aperture.
> >
> >     There needs to be a kernel function which moves a block of
> memory into
> > cacheable space.  We could do with with PCI dma, or some magic
> conversion of
> > unbound agp pages.  This could be made safe, and wouldn't be a
> big deal with
> > the new agpgart vm stuff.  That way the block of agp memory could be
> > accessed by a fallback or some other function that needs to
> directly read
> > the texture.  Readback from normal agp memory is horrible,
> something on the
> > order of 60MB/sec.
>
> The conversion would probably be better.  It would also play nice with
> ARB_vertex_array_objects.

Also I should point out, on some systems we have the nice ability to have
cached agp memory.  On these systems we need no conversion, or perhaps just
moving the texture into a cachable memory block.  On these systems it might
even make sense to have all textures marked cachable, but that will take
some experimentation.

>
> Also, how does this all work without AGP?  There still are a fair number
> of PCI cards out there. :)
>
> A lot of this is also very Linux specific.  What can we do to make as
> much of this as possible OS independent?  I don't think our BSD friends
> will be very happy if we leave them in the cold. :)  Linux is most
> people's first priority, but it's not the /only/ priority...

While it is Linux specific, the modifications and improvements I make to
agpgart to make this happen can be ported.  I don't think it will require
too much more then that, the functions that will plug into the kernel could
all be portable much like the rest of the driver code is currently.  Some
nice additions to agpgart are all that is required to make this possible I
think.

As for using pci dma or simple copying of card memory to pci memory that
would probably be directly portable without any or little effort.

-Jeff



-------------------------------------------------------
This SF.NET email is sponsored by: Thawte.com - A 128-bit supercerts will
allow you to extend the highest allowed 128 bit encryption to all your 
clients even if they use browsers that are limited to 40 bit encryption. 
Get a guide here:http://ads.sourceforge.net/cgi-bin/redirect.pl?thaw0030en
_______________________________________________
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel

RE: [Dri-devel] The next round of texture memory management...

Reply via email to