Re: [Mesa-dev] [PATCH] r600g: fix abysmal performance in Reaction Quake

Christian König Mon, 12 Nov 2012 03:24:38 -0800

On 12.11.2012 11:08, Michel Dänzer wrote:

On Sam, 2012-11-10 at 16:52 +0100, Marek Olšák wrote:

On Fri, Nov 9, 2012 at 9:44 PM, Jerome Glisse <[email protected]> wrote:

On Thu, Nov 01, 2012 at 03:13:31AM +0100, Marek Olšák wrote:

On Thu, Nov 1, 2012 at 2:13 AM, Alex Deucher <[email protected]> wrote:

On Wed, Oct 31, 2012 at 8:05 PM, Marek Olšák <[email protected]> wrote:

The problem was we set VRAM|GTT for relocations of STATIC resources.
Setting just VRAM increases the framerate 4 times on my machine.


I rewrote the switch statement and adjusted the domains for window
framebuffers too.

Reviewed-by: Alex Deucher <[email protected]>

Stable branches?

Yes, good idea.

Marek

Btw as a follow up on this, i did some experiment with ttm and eviction.
Blocking any vram eviction improve average fps (20-30%) and minimum fps
(40-60%) but it diminish maximum fps (100%). Overall blocking eviction
just make framerate more consistant.

I then tried several heuristic on the eviction process (not evicting buffer
if buffer was use in the last 1ms, 10ms, 20ms ..., sorting lru differently
btw buffer used for rendering and auxiliary buffer use by kernel, ...
none of those heuristic improved anything. I also removed bo wait in the
eviction pipeline but still no improvement. Haven't time to look further
but anyway bottom line is that some benchmark are memory tight and constant
eviction hurt.

(used unigine heaven and reaction quake for benchmark)

I've came up with the following solution, which I think would help
improve the situation a lot.

We should prepare a list of command streams and one list of
relocations for an entire frame, do buffer validation/placements for
the entire frame at the beginning and then just render the whole frame
(schedule all the command streams at once). That would minimize the
buffer evictions and give us the ideal buffer placements for the whole
frame and then the GPU would run the commands uninterrupted by other
processes (and we don't have to flush caches so much).

The only downsides are:
- Buffers would be marked as "busy" for the entire frame, because the
fence would only be at the end of the frame. We definitely need more
fine-grained distribution of fences for apps which map buffers during
rendering. One possible solution is to let userspace emit fences by
itself and associate the fences with the buffers in the relocation
list. The bo-wait mechanism would then use the fence from the (buffer,
fence) pair, while TTM would use the end-of-frame fence (we can't
trust the userspace giving us the right fences).
- We should find out how to offload flushing and SwapBuffers to
another thread, because the final CS ioctl will be really big.
Currently, the radeon winsys doesn't offload the CS ioctl if it's in
the SwapBuffers call.

- Deferring to a single big flush like that might introduce additional
latency before the GPU starts processing a frame and hurt some apps.

Instead of fencing the buffers in userspace how about something likethis for the kernel CS interface:


RADEON_CHUNK_ID_IB
RADEON_CHUNK_ID_IB
RADEON_CHUNK_ID_IB
RADEON_CHUNK_ID_IB
RADEON_CHUNK_ID_RELOCS
RADEON_CHUNK_ID_IB
RADEON_CHUNK_ID_IB
RADEON_CHUNK_ID_RELOCS
RADEON_CHUNK_ID_IB
RADEON_CHUNK_ID_IB
RADEON_CHUNK_ID_RELOCS
RADEON_CHUNK_ID_FLAGS

Fences are only emitted at RADEON_CHUNK_ID_RELOCS borders, but the wholeCS call is submitted as one single chunk of work and so all BOs getreserved and placed at once. That of course doesn't help with the higherlatency before actually starting a frame, but I don't think that thiswould actually be such a big problem.

Possible improvement:
- The userspace should emit commands into a GPU buffer and not in the
user memory, so that we don't have to do copy_from_user in the kernel.
I expect the CS ioctl to unmap the GPU buffer and forbid later mapping
as well as putting the buffer in the relocation list.

Unmapping etc. shouldn't be necessary in the long run with GPUVM.

We already have patches in internal review that allows userspace tosubmit IBs without any CS checking and so can avoid the wholecopy_from_user and check overhead. So don't worry to much about thisproblem.


Christian.

_______________________________________________
mesa-dev mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/mesa-dev

Re: [Mesa-dev] [PATCH] r600g: fix abysmal performance in Reaction Quake

Reply via email to