Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

Christian König Tue, 20 Apr 2021 05:19:33 -0700

Hi Daniel,

Am 20.04.21 um 14:01 schrieb Daniel Vetter:

On Mon, Apr 19, 2021 at 06:47:48AM -0400, Marek Olšák wrote:

Hi,


This is our initial proposal for explicit fences everywhere and new memory
management that doesn't use BO fences. It's a redesign of how Linux
graphics drivers work, and it can coexist with what we have now.


*1. Introduction*
(skip this if you are already sold on explicit fences)

The current Linux graphics architecture was initially designed for GPUs
with only one graphics queue where everything was executed in the
submission order and per-BO fences were used for memory management and
CPU-GPU synchronization, not GPU-GPU synchronization. Later, multiple
queues were added on top, which required the introduction of implicit
GPU-GPU synchronization between queues of different processes using per-BO
fences. Recently, even parallel execution within one queue was enabled
where a command buffer starts draws and compute shaders, but doesn't wait
for them, enabling parallelism between back-to-back command buffers.
Modesetting also uses per-BO fences for scheduling flips. Our GPU scheduler
was created to enable all those use cases, and it's the only reason why the
scheduler exists.

The GPU scheduler, implicit synchronization, BO-fence-based memory
management, and the tracking of per-BO fences increase CPU overhead and
latency, and reduce parallelism. There is a desire to replace all of them
with something much simpler. Below is how we could do it.

I get the feeling you're mixing up a lot of things here that have more
nuance, so first some lingo.

- There's kernel based synchronization, based on dma_fence. These come in
   two major variants: Implicit synchronization, where the kernel attaches
   the dma_fences to a dma-buf, and explicit synchronization, where the
   dma_fence gets passed around as a stand-alone object, either a sync_file
   or a drm_syncobj

- Then there's userspace fence synchronization, where userspace issues any
   fences directly and the kernel doesn't even know what's going on. This
   is the only model that allows you to ditch the kernel overhead, and it's
   also the model that vk uses.

   I concur with Jason that this one is the future, it's the model hw
   wants, compute wants and vk wants. Building an explicit fence world
   which doesn't aim at this is imo wasted effort.

Now you smash them into one thing by also changing the memory model, but I
think that doesn't work:

- Relying on gpu page faults across the board wont happen. I think right
   now only amd's GFX10 or so has enough pagefault support to allow this,


It's even worse. GFX9 has enough support so that in theory can work.

Because of this Felix and his team are working on HMM support based onthis generation.

On GFX10 some aspects of it are improved while others are totally brokenagain.

   and not even there I'm really sure. Nothing else will anytime soon, at
   least not as far as I know. So we need to support slightly more hw in
   upstream than just that.  Any plan that's realistic needs to cope with
   dma_fence for a really long time.

- Pown^WPin All The Things! is probably not a general enough memory
   management approach. We've kinda tried for years to move away from it.
   Sure we can support it as an optimization in specific workloads, and it
   will make stuff faster, but it's not going to be the default I think.

- We live in a post xf86-video-$vendor world, and all these other
   compositors rely on implicit sync. You're not going to be able to get
   rid of them anytime soon. What's worse, all the various EGL/vk buffer
   sharing things also rely on implicit sync, so you get to fix up tons of
   applications on top. Any plan that's realistic needs to cope with
   implicit/explicit at the same time together won't work.

- Absolute infuriating, but you can't use page-faulting together with any
   dma_fence synchronization primitives, whether implicit or explicit. This
   means until the entire ecosystem moved forward (good luck with that) we
   have to support dma_fence. The only sync model that works together with
   page faults is userspace fence based sync.

Then there's the somewhat aside topic of how amdgpu/radeonsi does implicit
sync, at least last I checked. Currently this oversynchronizes badly
because it's left to the kernel to guess what should be synchronized, and
that gets things wrong. What you need there is explicit implicit
synchronization:

- on the cs side, userspace must set explicit for which buffers the kernel
   should engage in implicit synchronization. That's how it works on all
   other drivers that support more explicit userspace like vk or gl drivers
   that are internally all explicit. So essentially you only set the
   implicit fence slot when you really want to, and only userspace knows
   this. Implementing this without breaking the current logic probably
   needs some flags.

- the other side isn't there yet upstream, but Jason has patches.
   Essentially you also need to sample your implicit sync points at the
   right spot, to avoid oversync on later rendering by the producer.
   Jason's patch solves this by adding an ioctl to dma-buf to get the
   current set.

- without any of this things for pure explicit fencing userspace the
   kernel will simply maintain a list of all current users of a buffer. For
   memory management, which means eviction handling roughly works like you
   describe below, we wait for everything before a buffer can be moved.

This should get rid of the oversync issues, and since implicit sync is
backed in everywhere right now, you'll have to deal with implicit sync for
a very long time.

Next up is reducing the memory manager overhead of all this, without
changing the ecosystem.

- hw option would be page faults, but until we have full explicit
   userspace sync we can't use those. Which currently means compute only.
   Note that for vulkan or maybe also gl this is quite nasty for userspace,
   since as soon as you need to switch to dma_fenc sync or implicit sync
   (winsys buffer, or buffer sharing with any of the current set of
   extensions) you have to flip your internal driver state around all sync
   points over from userspace fencing to dma_fence kernel fencing. Can
   still be all explicit using drm_syncobj ofc.

- next up if your hw has preemption, you could use that, except preemption
   takes a while longer, so from memory pov really should be done with
   dma_fence. Plus it has all the same problems in that it requires
   userspace fences.

- now for making dma_fence O(1) in the fastpath you need the shared
   dma_resv trick and the lru bulk move. radv/amdvlk use that, but I think
   radeonsi not yet. But maybe I missed that. Either way we need to do some
   better kernel work so it can also be fast for shared buffers, if those
   become a problem. On the GL side doing this will use a lot of the tricks
   for residency/working set management you describe below, except the
   kernel can still throw out an entire gpu job. This is essentially what
   you describe with 3.1. Vulkan/compute already work like this.

Now this gets the performance up, but it doesn't give us any road towards
using page faults (outside of compute) and so retiring dma_fence for good.
For that we need a few pieces:

- Full new set of userspace winsys protocols and egl/vk extensions. Pray
   it actually gets adopted, because neither AMD nor Intel have the
   engineers to push these kind of ecosystems/middleware issues forward on
   their payrolls. Good pick is probably using drm_syncobj as the kernel
   primitive for this. Still uses dma_fence underneath.

- Some clever kernel tricks so that we can substitute dma_fence for
   userspace fences within a drm_syncobj. drm_syncobj already has the
   notion of waiting for a dma_fence to materialize. We can abuse that to
   create an upgrade path from dma_fence based sync to userspace fence
   syncing. Ofc none of this will be on the table if userspace hasn't
   adopted explicit sync.

With these two things I think we can have a reasonable upgrade path. None
of this will be break the world type things though.


How about this:

1. We extend drm_syncobj to be able to contain both classic dma_fence aswell as being used for user fence synchronization.

We already discussed that briefly and I think we should have arough plan for this in our heads.


2. We allow attaching an drm_syncobj on dma_resv for implicit sync.

This requires that both the consumer as well as the producer sidewill support user fence synchronization.

We would still have quite a bunch of limitations, especially wewould need to adjust all the kernel consumers of classic dma_resvobjects. But I think it should be doable.


Regards,
Christian.


Bunch of comments below.

*2. Explicit synchronization for window systems and modesetting*

The producer is an application and the consumer is a compositor or a
modesetting driver.

*2.1. The Present request*

As part of the Present request, the producer will pass 2 fences (sync
objects) to the consumer alongside the presented DMABUF BO:
- The submit fence: Initially unsignalled, it will be signalled when the
producer has finished drawing into the presented buffer.
- The return fence: Initially unsignalled, it will be signalled when the
consumer has finished using the presented buffer.

Build this with syncobj timelines and it makes a lot more sense I think.
We'll need that for having a proper upgrade path, both on the hw/driver
side (being able to support stuff like preempt or gpu page faults) and the
ecosystem side (so that we don't have to rev protocols twice, once going
to explicit dma_fence sync and once more for userspace sync).

Deadlock mitigation to recover from segfaults:
- The kernel knows which process is obliged to signal which fence. This
information is part of the Present request and supplied by userspace.
- If the producer crashes, the kernel signals the submit fence, so that the
consumer can make forward progress.
- If the consumer crashes, the kernel signals the return fence, so that the
producer can reclaim the buffer.

So for kernel based sync imo simplest is to just reuse dma_fence, same
rules apply.

For userspace fencing the kernel simply doesn't care how stupid userspace
is. Security checks at boundaries (e.g. client vs compositor) is also
usersepace's problem and can be handled by e.g.  timeouts + conditional
rendering on the compositor side. The timeout might be in the compat glue,
e.g. when we stall for a dma_fence to materialize from a drm_syncobj. I
think in vulkan this is defacto already up to applications to deal with
entirely if they deal with untrusted fences.

- A GPU hang signals all fences. Other deadlocks will be handled like GPU
hangs.

Nope, we can't just shrug off all deadlocks with "gpu reset rolls in". For
one, with userspace fencing the kernel isn't aware of any deadlocks, you
fundamentally can't tell "has deadlocked" from "is still doing useful
computations" because that amounts to solving the halting problem.

Any programming model we come up with where both kernel and userspace are
involved needs to come up with rules where at least non-evil userspace
never deadlocks. And if you just allow both then it's pretty easy to come
up with scenarios where both userspace and kernel along are deadlock free,
but interactions result in hangs. That's why we've recently documented all
the corner cases around indefinite dma_fences, and also why you can't use
gpu page faults currently anything that uses dma_fence for sync.

That's why I think with userspace fencing the kernel simply should not be
involved at all, aside from providing optimized/blocking cpu wait
functionality.

Other window system requests can follow the same idea.

Merged fences where one fence object contains multiple fences will be
supported. A merged fence is signalled only when its fences are signalled.
The consumer will have the option to redefine the unsignalled return fence
to a merged fence.

*2.2. Modesetting*

Since a modesetting driver can also be the consumer, the present ioctl will
contain a submit fence and a return fence too. One small problem with this
is that userspace can hang the modesetting driver, but in theory, any later
present ioctl can override the previous one, so the unsignalled
presentation is never used.


*3. New memory management*

The per-BO fences will be removed and the kernel will not know which
buffers are busy. This will reduce CPU overhead and latency. The kernel
will not need per-BO fences with explicit synchronization, so we just need
to remove their last user: buffer evictions. It also resolves the current
OOM deadlock.

What's "the current OOM deadlock"?

*3.1. Evictions*

If the kernel wants to move a buffer, it will have to wait for everything
to go idle, halt all userspace command submissions, move the buffer, and
resume everything. This is not expected to happen when memory is not
exhausted. Other more efficient ways of synchronization are also possible
(e.g. sync only one process), but are not discussed here.

*3.2. Per-process VRAM usage quota*

Each process can optionally and periodically query its VRAM usage quota and
change domains of its buffers to obey that quota. For example, a process
allocated 2 GB of buffers in VRAM, but the kernel decreased the quota to 1
GB. The process can change the domains of the least important buffers to
GTT to get the best outcome for itself. If the process doesn't do it, the
kernel will choose which buffers to evict at random. (thanks to Christian
Koenig for this idea)

*3.3. Buffer destruction without per-BO fences*

When the buffer destroy ioctl is called, an optional fence list can be
passed to the kernel to indicate when it's safe to deallocate the buffer.
If the fence list is empty, the buffer will be deallocated immediately.
Shared buffers will be handled by merging fence lists from all processes
that destroy them. Mitigation of malicious behavior:
- If userspace destroys a busy buffer, it will get a GPU page fault.
- If userspace sends fences that never signal, the kernel will have a
timeout period and then will proceed to deallocate the buffer anyway.

*3.4. Other notes on MM*

Overcommitment of GPU-accessible memory will cause an allocation failure or
invoke the OOM killer. Evictions to GPU-inaccessible memory might not be
supported.

Kernel drivers could move to this new memory management today. Only buffer
residency and evictions would stop using per-BO fences.



*4. Deprecating implicit synchronization*

It can be phased out by introducing a new generation of hardware where the
driver doesn't add support for it (like a driver fork would do), assuming
userspace has all the changes for explicit synchronization. This could
potentially create an isolated part of the kernel DRM where all drivers
only support explicit synchronization.

10-20 years I'd say before that's even an option.
-Daniel

Marek
_______________________________________________
dri-devel mailing list
dri-de...@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


_______________________________________________
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

Reply via email to