On Mon, Nov 12, 2012 at 12:23 PM, Christian König <[email protected]> wrote: > On 12.11.2012 11:08, Michel Dänzer wrote: >> >> On Sam, 2012-11-10 at 16:52 +0100, Marek Olšák wrote: >>> >>> On Fri, Nov 9, 2012 at 9:44 PM, Jerome Glisse <[email protected]> wrote: >>>> >>>> On Thu, Nov 01, 2012 at 03:13:31AM +0100, Marek Olšák wrote: >>>>> >>>>> On Thu, Nov 1, 2012 at 2:13 AM, Alex Deucher <[email protected]> >>>>> wrote: >>>>>> >>>>>> On Wed, Oct 31, 2012 at 8:05 PM, Marek Olšák <[email protected]> wrote: >>>>>>> >>>>>>> The problem was we set VRAM|GTT for relocations of STATIC resources. >>>>>>> Setting just VRAM increases the framerate 4 times on my machine. >>>>>>> >>>>>>> I rewrote the switch statement and adjusted the domains for window >>>>>>> framebuffers too. >>>>>> >>>>>> Reviewed-by: Alex Deucher <[email protected]> >>>>>> >>>>>> Stable branches? >>>>> >>>>> Yes, good idea. >>>>> >>>>> Marek >>>> >>>> Btw as a follow up on this, i did some experiment with ttm and eviction. >>>> Blocking any vram eviction improve average fps (20-30%) and minimum fps >>>> (40-60%) but it diminish maximum fps (100%). Overall blocking eviction >>>> just make framerate more consistant. >>>> >>>> I then tried several heuristic on the eviction process (not evicting >>>> buffer >>>> if buffer was use in the last 1ms, 10ms, 20ms ..., sorting lru >>>> differently >>>> btw buffer used for rendering and auxiliary buffer use by kernel, ... >>>> none of those heuristic improved anything. I also removed bo wait in the >>>> eviction pipeline but still no improvement. Haven't time to look further >>>> but anyway bottom line is that some benchmark are memory tight and >>>> constant >>>> eviction hurt. >>>> >>>> (used unigine heaven and reaction quake for benchmark) >>> >>> I've came up with the following solution, which I think would help >>> improve the situation a lot. >>> >>> We should prepare a list of command streams and one list of >>> relocations for an entire frame, do buffer validation/placements for >>> the entire frame at the beginning and then just render the whole frame >>> (schedule all the command streams at once). That would minimize the >>> buffer evictions and give us the ideal buffer placements for the whole >>> frame and then the GPU would run the commands uninterrupted by other >>> processes (and we don't have to flush caches so much). >>> >>> The only downsides are: >>> - Buffers would be marked as "busy" for the entire frame, because the >>> fence would only be at the end of the frame. We definitely need more >>> fine-grained distribution of fences for apps which map buffers during >>> rendering. One possible solution is to let userspace emit fences by >>> itself and associate the fences with the buffers in the relocation >>> list. The bo-wait mechanism would then use the fence from the (buffer, >>> fence) pair, while TTM would use the end-of-frame fence (we can't >>> trust the userspace giving us the right fences). >>> - We should find out how to offload flushing and SwapBuffers to >>> another thread, because the final CS ioctl will be really big. >>> Currently, the radeon winsys doesn't offload the CS ioctl if it's in >>> the SwapBuffers call. >> >> - Deferring to a single big flush like that might introduce additional >> latency before the GPU starts processing a frame and hurt some apps. > > > Instead of fencing the buffers in userspace how about something like this > for the kernel CS interface: > > RADEON_CHUNK_ID_IB > RADEON_CHUNK_ID_IB > RADEON_CHUNK_ID_IB > RADEON_CHUNK_ID_IB > RADEON_CHUNK_ID_RELOCS > RADEON_CHUNK_ID_IB > RADEON_CHUNK_ID_IB > RADEON_CHUNK_ID_RELOCS > RADEON_CHUNK_ID_IB > RADEON_CHUNK_ID_IB > RADEON_CHUNK_ID_RELOCS > RADEON_CHUNK_ID_FLAGS > > Fences are only emitted at RADEON_CHUNK_ID_RELOCS borders, but the whole CS > call is submitted as one single chunk of work and so all BOs get reserved > and placed at once. That of course doesn't help with the higher latency > before actually starting a frame, but I don't think that this would actually > be such a big problem.
The latency can add an input lag, which can negatively impact game experience especially in first person shooters. In the long run I think Radeon/TTM should plan buffer moves across several frames, i.e. an incremental approach that should eventually converge to the ideal state, i.e. a process generating the highest GPU load should get much better buffer placements than the idling processes after like 10-20 CS ioctls and with a guarantee that its buffers won't be evicted anytime soon. Also I think the domains in the relocation list should be ignored completely and only the initial domain from GEM_CREATE should be taken into account, because that's the domain the gallium driver wanted in the first place. Marek _______________________________________________ mesa-dev mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/mesa-dev
