[ Mostly specific to Radeon and TTM, but general aspects as well ] So, I was looking into the performance of the radeon-rewrite r300 driver and found something funny: most of the time was being spent allocating the DMA buffers used for vertex data.
Basically, what would happen was: - Start of command buffer - Some vertex data written - 1 megabyte vertex buffer allocated - More vertex data written, maybe 2k of the vertex buffer used - Command buffer fills and flushed, vertex buffer released - Some vertex data written - 1 megabyte vertex buffer allocated - ... It turns out that allocating the DMA buffer is actually pretty expensive, and that the expense is proportional to the size of the buffer. On my system, allocating a 1M buffer was taking about 1.5 milliseconds. (1.7GHz Pentium M). http://fishsoup.net/images/radeon-dma-buffer-profile.png Shows a profile of approximately where the time is going. (This was after reducing the buffer size from 1M to 256k - the proportion of overall time spent in this operation was considerably greater with the 1M DMA buffers, but the relative times are the same.) I think the portion under set_memory_wc() is likely optimizable by doing some of the work *once* instead of per-page. (flush_all_zero_pkmaps ends up being called once per page.) But drm_cflush_pages() looks hard to optimize further - in essence it just does a clflush instruction for each 64-byte chunk of the allocated memory. Some ideas about possible ways to improve things: * Just decrease the size of the DMA buffers. Reducing the size of the buffers from 1M to 64k dropped the time allocating buffers from >60% of the CPU usage to ~5%. This will restrict very large single operations. It also increases the overhead for apps using lots of vertices without using VBO's. * Not release the DMA buffer when flushing the command buffer but only when it fills. Not sure of the tradeoffs here. * Increase the command buffer size. Not sure to what extent this is possible. There may be a hard 16k limit for Radeon hardware looking at the code. Also, you'd have to increase it a lot. Making it twice as big, say, wouldn't help much. * Optimize cache flushing on DMA buffer allocation. If we're allocating pages for the DMA buffer that have been zeroed sufficiently far in the past we probably don't need to flush them out of the processor cache. Or if the pages were zeroed with uncached writes. Possibilities here for someone sufficiently comfortable with deep kernel memory management voodoo. [ Some possibility that the whole thing is just a simple TTM bug and the expensive code wasn't meant to be hit when allocating a new buffer in TT domain ] * Only map the relevant part of of the DMA buffer into the aperture The basic way we work now, is that we allocate a large buffer, map *all* pages into the aperture, then write to it from the CPU, then once we are done, we read from it from the GPU. We could allocated the buffer without mapping it into the aperture, write to it from the CPU, truncate it to the actual length used, then let the kernel map it into the aperture when executing the command buffer. This would require API extension to allow truncating a buffer object. - Owen ------------------------------------------------------------------------------ Register Now for Creativity and Technology (CaT), June 3rd, NYC. CaT is a gathering of tech-side developers & brand creativity professionals. Meet the minds behind Google Creative Lab, Visual Complexity, Processing, & iPhoneDevCamp asthey present alongside digital heavyweights like Barbarian Group, R/GA, & Big Spaceship. http://www.creativitycat.com _______________________________________________ Mesa3d-dev mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/mesa3d-dev
