On Sun, 2009-05-24 at 11:44 -0400, Owen Taylor wrote: > [ Mostly specific to Radeon and TTM, but general aspects as well ] > > So, I was looking into the performance of the radeon-rewrite r300 driver > and found something funny: most of the time was being spent allocating > the DMA buffers used for vertex data. > > Basically, what would happen was: > > - Start of command buffer > - Some vertex data written - 1 megabyte vertex buffer allocated > - More vertex data written, maybe 2k of the vertex buffer used > > - Command buffer fills and flushed, vertex buffer released > - Some vertex data written - 1 megabyte vertex buffer allocated > - ... > > It turns out that allocating the DMA buffer is actually pretty > expensive, and that the expense is proportional to the size of the > buffer. > > On my system, allocating a 1M buffer was taking about 1.5 milliseconds. > (1.7GHz Pentium M). > > http://fishsoup.net/images/radeon-dma-buffer-profile.png > > Shows a profile of approximately where the time is going. (This was > after reducing the buffer size from 1M to 256k - the proportion of > overall time spent in this operation was considerably greater > with the 1M DMA buffers, but the relative times are the same.) > > I think the portion under set_memory_wc() is likely optimizable by doing > some of the work *once* instead of per-page. (flush_all_zero_pkmaps ends > up being called once per page.) > > But drm_cflush_pages() looks hard to optimize further - in essence it > just does a clflush instruction for each 64-byte chunk of the allocated > memory. > > Some ideas about possible ways to improve things: > > * Just decrease the size of the DMA buffers. Reducing the size of the > buffers from 1M to 64k dropped the time allocating buffers from >60% > of the CPU usage to ~5%. > > This will restrict very large single operations. It also increases > the overhead for apps using lots of vertices without using VBO's. > > * Not release the DMA buffer when flushing the command buffer but only > when it fills. > > Not sure of the tradeoffs here. > > * Increase the command buffer size. > > Not sure to what extent this is possible. There may be a hard 16k > limit for Radeon hardware looking at the code. Also, you'd have to > increase it a lot. Making it twice as big, say, wouldn't help much. > > * Optimize cache flushing on DMA buffer allocation. > > If we're allocating pages for the DMA buffer that have been zeroed > sufficiently far in the past we probably don't need to flush them > out of the processor cache. Or if the pages were zeroed with > uncached writes. Possibilities here for someone sufficiently > comfortable with deep kernel memory management voodoo. > > [ Some possibility that the whole thing is just a simple TTM bug > and the expensive code wasn't meant to be hit when allocating a > new buffer in TT domain ] > > * Only map the relevant part of of the DMA buffer into the aperture > > The basic way we work now, is that we allocate a large buffer, > map *all* pages into the aperture, then write to it from the CPU, > then once we are done, we read from it from the GPU. > > We could allocated the buffer without mapping it into the aperture, > write to it from the CPU, truncate it to the actual length used, > then let the kernel map it into the aperture when executing the > command buffer. > > This would require API extension to allow truncating a buffer object. > > - Owen
Thanks for looking into this, you have done your testing with rawhide (f11 kernel) so it should use Dave page allocator which have a pool of uc page so which should avoid to routinely call set_memory_uc and thus it shouldn't call set_memory_uc that much maybe somethings is wrong in the path which should call this allocator. Anyway i think the plan for newttm is to use such page allocator so we can avoid changing cache status of page on every allocation, also we would like to avoid zeroing page more than necessary for this we need that the userspace keep the buffer around. So in the end the call to drm_ttm_set_cahing (or equivalent in newttm) should happen only once at buffer creation. Btw the limit on radeon for vertex buffer is quite high: max number of vertices * max stride = 65536 * (128 * 4) = 32M but i think it's unlikely to have a stride 128 dwords, common case would around 32 dwords i think. Cheers, Jerome ------------------------------------------------------------------------------ Register Now for Creativity and Technology (CaT), June 3rd, NYC. CaT is a gathering of tech-side developers & brand creativity professionals. Meet the minds behind Google Creative Lab, Visual Complexity, Processing, & iPhoneDevCamp asthey present alongside digital heavyweights like Barbarian Group, R/GA, & Big Spaceship. http://www.creativitycat.com _______________________________________________ Mesa3d-dev mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/mesa3d-dev
