Re: Handling transient data

Thomas Hellström Mon, 03 Mar 2008 09:34:59 -0800

Keith Packard wrote:
> While buffer objects provide a huge benefit for long-term objects, their
> benefit for transient data is less clear. Page allocation, cache
> flushing and mmap costs are all significant. This missive is designed to
> outline some of the potential mechanisms to ameliorate or transfer these
> costs to other areas. Please treat it as a suggestion for a list of
> experiments to try, not as a directive for how things should be done.
>
> Let's focus on the batch buffer, as that is a fairly universal object in
> this realm.
>
> In the Intel driver, batch buffers are handled as follows:
>
>      1. Allocate a buffer object
>      2. Map the buffer object to user space
>      3. Write data to the object - faulting in new pages
>      4. Pass the object to the super-ioctl - allocating remaining pages
>      5. Clflush the memory range of the object
>      6. Flush the chipset
>      7. Map the object to the GTT
>      8. When fence passes, free the object
>
> Profiling the Intel driver has shown us that several of these steps are
> expensive:
>
>      A. Allocating pages one-at-a-time
>      B. Mapping pages to user space
>      C. Writing to un-cached buffer object pages
>      D. Flushing CPU caches
>
> Two kernel developers suggested to me this week that it would likely be
> cheaper to copy data in the kernel than to use mmap and write the data
> in user space. Re-using the same buffer in user space would mean that
> the stores would be to cached memory (avoiding loading the entire batch
> buffer into the cache). Using non-temporal stores would mean that the
> copy wouldn't load the destination buffer into cache, only to
> immediately flush it back out again with clflush.
>
> They also said that allocating pages one-at-a-time is very inefficient
> and that we should be asking for as many as we need up-front, and then
> falling back to smaller allocations when the larger allocation is not
> available. The kernel groups pages in power-of-two buckets; doing our
> allocations atomically would also tend to reduce memory fragmentation
> within the kernel. This will not increase memory usage as we already
> allocate all of the pages -- there is no benefit to allocating them
> slowly.
>
> Eric already experimented with re-using buffer objects from the driver.
> Buffers would be allocated in a power-of-two size bucket; when a buffer
> was freed, it would be placed on a list of same-sized buffers. At
> allocation time, the driver would check the first element of the
> appropriate list to see if it was idle. If so, that buffer would be
> re-used. Otherwise, a new buffer would be allocated. The results were
> impressive -- greater than 20% performance improvement in openarena.
> However, this pins a huge amount of memory, and still doesn't avoid the
> cache effects from all of the flushing.
>
> So, here's a list of things I think we should try:
>
> 1. Allocate all BO pages at create time.
>
>         This will eliminate the page-fault overhead, clump DRM pages
>         together avoiding memory fragmentation and reduce the cost of
>         allocation. The 965 driver is spending 20-30% of the CPU
>         allocating pages currently.
>
> 2. Add an ioctl for CopyBOSubData.
>
>         Create a buffer in user space to hold batch-buffer contents that
>         is re-used for each new batch. When the batch is full, copy it
>         to the buffer object with this kernel call. Use non-temporal
>         stores to avoid bringing the destination buffer into cache.
>         
> 3. Use the GTT for CopyBOSubData
>
>         I wonder if doing the CopyBOSubData through the GTT would be
>         more efficient. It would eliminate the need to flush the
>         chipset, and would also avoid any question of whether
>         non-temporal stores would flush data all the way to the chipset.
>         
> This would change buffer management to:
>
>      1. Allocate a buffer object and all of the pages
>      2. Write data to a user-space buffer
>      3. Map the object to the GTT
>      4. Copy data to the buffer object
>      5. Pass the object to the super-ioctl
>      6. When the fence passes, free the object
>
>   
Keith,
The three suggested improvements should probably all provide some 
performance benefit used in the right context. The app "ttmtest" can 
probably be used to provide some additional insight to where the 
bottlenecks are.


1) Allocating all pages at once:
Yes, I think this might improve  performance in some cases. The reason 
it hasn't been done already is the added complexity needed to keep track 
of  the different allocation sizes. One optimization that´s already in 
the pipeline is to page in more that a single page (for example 16) when 
we hit a pagefault in nopfn().

2) Copying buffers in drm. This is to avoid vma creation and pagefaults, 
right? Yes, that could be an improvement *if* kmap_atomic is used to 
provide the kernel mapping. Doing vmap on a whole buffer is probably 
almost as expensive as a user-space mapping, and will waste precious 
vmalloc space. User-space buffer mappings aren't really that expensive, 
and a second map on the same buffer is essentially a no_op, unless you 
are using DRM_BO_FLAG_CACHED_MAPPED.

3) Copying buffers through the GATT. I assume you're referring to 
binding the buffer to a pre-mapped region of the GATT and then do the 
copying without setting up a new CPU map? That's certainly possible and 
a good candidate for performing relocations if you can't do kmap_atomic().

However, if you were to reuse buffers in user-space and just use plain 
old  !DRM_BO_FLAG_CACHED none of these would be real issues. Buffers 
will stay bound to the GTT unless they get evicted, and the user-space 
vmas would stay populated. You'd pay a performance price the first time 
a buffer is created and when it is destroyed.

Also, if one is prepared to go a step even further to use user-space 
buffer pools for these things you're even better off performance-wise. 
An old i915tex driver patched for the latest  DRM environment  would, 
for simple apps like gears, only average sligthly above 2 kernel calls 
per batch-buffer, and that's the execbuffer call itself, a fence 
unreference and an occasional fence sync.
The price you pay for this is more pinned pages in the buffer pools.

/Thomas


 









> ------------------------------------------------------------------------
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by: Microsoft
> Defy all challenges. Microsoft(R) Visual Studio 2008.
> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> ------------------------------------------------------------------------
>
> --
> _______________________________________________
> Dri-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dri-devel
>   




-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
--
_______________________________________________
Dri-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dri-devel

Re: Handling transient data

Reply via email to