On Sam, 2012-11-10 at 16:52 +0100, Marek Olšák wrote: > On Fri, Nov 9, 2012 at 9:44 PM, Jerome Glisse <[email protected]> wrote: > > On Thu, Nov 01, 2012 at 03:13:31AM +0100, Marek Olšák wrote: > >> On Thu, Nov 1, 2012 at 2:13 AM, Alex Deucher <[email protected]> wrote: > >> > On Wed, Oct 31, 2012 at 8:05 PM, Marek Olšák <[email protected]> wrote: > >> >> The problem was we set VRAM|GTT for relocations of STATIC resources. > >> >> Setting just VRAM increases the framerate 4 times on my machine. > >> >> > >> >> I rewrote the switch statement and adjusted the domains for window > >> >> framebuffers too. > >> > > >> > Reviewed-by: Alex Deucher <[email protected]> > >> > > >> > Stable branches? > >> > >> Yes, good idea. > >> > >> Marek > > > > Btw as a follow up on this, i did some experiment with ttm and eviction. > > Blocking any vram eviction improve average fps (20-30%) and minimum fps > > (40-60%) but it diminish maximum fps (100%). Overall blocking eviction > > just make framerate more consistant. > > > > I then tried several heuristic on the eviction process (not evicting buffer > > if buffer was use in the last 1ms, 10ms, 20ms ..., sorting lru differently > > btw buffer used for rendering and auxiliary buffer use by kernel, ... > > none of those heuristic improved anything. I also removed bo wait in the > > eviction pipeline but still no improvement. Haven't time to look further > > but anyway bottom line is that some benchmark are memory tight and constant > > eviction hurt. > > > > (used unigine heaven and reaction quake for benchmark) > > I've came up with the following solution, which I think would help > improve the situation a lot. > > We should prepare a list of command streams and one list of > relocations for an entire frame, do buffer validation/placements for > the entire frame at the beginning and then just render the whole frame > (schedule all the command streams at once). That would minimize the > buffer evictions and give us the ideal buffer placements for the whole > frame and then the GPU would run the commands uninterrupted by other > processes (and we don't have to flush caches so much). > > The only downsides are: > - Buffers would be marked as "busy" for the entire frame, because the > fence would only be at the end of the frame. We definitely need more > fine-grained distribution of fences for apps which map buffers during > rendering. One possible solution is to let userspace emit fences by > itself and associate the fences with the buffers in the relocation > list. The bo-wait mechanism would then use the fence from the (buffer, > fence) pair, while TTM would use the end-of-frame fence (we can't > trust the userspace giving us the right fences). > - We should find out how to offload flushing and SwapBuffers to > another thread, because the final CS ioctl will be really big. > Currently, the radeon winsys doesn't offload the CS ioctl if it's in > the SwapBuffers call.
- Deferring to a single big flush like that might introduce additional latency before the GPU starts processing a frame and hurt some apps. > Possible improvement: > - The userspace should emit commands into a GPU buffer and not in the > user memory, so that we don't have to do copy_from_user in the kernel. > I expect the CS ioctl to unmap the GPU buffer and forbid later mapping > as well as putting the buffer in the relocation list. Unmapping etc. shouldn't be necessary in the long run with GPUVM. -- Earthling Michel Dänzer | http://www.amd.com Libre software enthusiast | Debian, X and DRI developer _______________________________________________ mesa-dev mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/mesa-dev
