[ Mostly specific to Radeon and TTM, but general aspects as well ]

So, I was looking into the performance of the radeon-rewrite r300 driver
and found something funny: most of the time was being spent allocating
the DMA buffers used for vertex data.

Basically, what would happen was:

 - Start of command buffer
 - Some vertex data written - 1 megabyte vertex buffer allocated
 - More vertex data written, maybe 2k of the vertex buffer used

 - Command buffer fills and flushed, vertex buffer released
 - Some vertex data written - 1 megabyte vertex buffer allocated
 - ...

It turns out that allocating the DMA buffer is actually pretty
expensive, and that the expense is proportional to the size of the
buffer.

On my system, allocating a 1M buffer was taking about 1.5 milliseconds.
(1.7GHz Pentium M).

http://fishsoup.net/images/radeon-dma-buffer-profile.png

Shows a profile of approximately where the time is going. (This was
after reducing the buffer size from 1M to 256k - the proportion of
overall time spent in this operation was considerably greater 
with the 1M DMA buffers, but the relative times are the same.)

I think the portion under set_memory_wc() is likely optimizable by doing
some of the work *once* instead of per-page. (flush_all_zero_pkmaps ends
up being called once per page.)

But drm_cflush_pages() looks hard to optimize further - in essence it
just does a clflush instruction for each 64-byte chunk of the allocated
memory.

Some ideas about possible ways to improve things:

 * Just decrease the size of the DMA buffers. Reducing the size of the
   buffers from 1M to 64k dropped the time allocating buffers from >60%
   of the CPU usage to ~5%.
   
   This will restrict very large single operations. It also increases 
   the overhead for apps using lots of vertices without using VBO's.

 * Not release the DMA buffer when flushing the command buffer but only
   when it fills.
  
   Not sure of the tradeoffs here.

 * Increase the command buffer size. 

   Not sure to what extent this is possible. There may be a hard 16k
   limit for Radeon hardware looking at the code. Also, you'd have to
   increase it a lot. Making it twice as big, say, wouldn't help much.

 * Optimize cache flushing on DMA buffer allocation.

   If we're allocating pages for the DMA buffer that have been zeroed
   sufficiently far in the past we probably don't need to flush them
   out of the processor cache. Or if the pages were zeroed with
   uncached writes. Possibilities here for someone sufficiently 
   comfortable with deep kernel memory management voodoo.

   [ Some possibility that the whole thing is just a simple TTM bug
     and the expensive code wasn't meant to be hit when allocating a
     new buffer in TT domain ]

 * Only map the relevant part of of the DMA buffer into the aperture
  
   The basic way we work now, is that we allocate a large buffer,
   map *all* pages into the aperture, then write to it from the CPU,
   then once we are done, we read from it from the GPU.

   We could allocated the buffer without mapping it into the aperture,
   write to it from the CPU, truncate it to the actual length used,
   then let the kernel map it into the aperture when executing the
   command buffer.

   This would require API extension to allow truncating a buffer object.

- Owen



------------------------------------------------------------------------------
Register Now for Creativity and Technology (CaT), June 3rd, NYC. CaT
is a gathering of tech-side developers & brand creativity professionals. Meet
the minds behind Google Creative Lab, Visual Complexity, Processing, & 
iPhoneDevCamp asthey present alongside digital heavyweights like Barbarian
Group, R/GA, & Big Spaceship. http://www.creativitycat.com 
_______________________________________________
Mesa3d-dev mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/mesa3d-dev

Reply via email to