Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

Christian König Tue, 20 Apr 2021 11:00:43 -0700


Am 20.04.21 um 19:44 schrieb Daniel Stone:

Hi,
On Tue, 20 Apr 2021 at 16:46, Jason Ekstrand <ja...@jlekstrand.net<mailto:ja...@jlekstrand.net>> wrote:
    It's still early in the morning here and I'm not awake yet so sorry if
    this comes out in bits and pieces...
No problem, it's helpful. If I weren't on this thread I'd beattempting to put together a 73-piece chest of drawers whoseinstructions are about as clear as this so far, so I'm in the righthead space anyway.
    IMO, there are two problems being solved here which are related in
    very subtle and tricky ways.  They're also, admittedly, driver
    problems, not really winsys problems.  Unfortunately, they may have
    winsys implications.


Yeah ... bingo.

    First, is better/real timelines for Vulkan and compute. [...]

    We also want something like this for compute workloads. [...]
Totally understand and agree with all of this. Memory fences seem likea good and useful primitive here.


Completely agree.

    The second biting issue is that, in the current kernel implementation
    of dma-fence and dma_resv, we've lumped internal synchronization for
    memory management together with execution synchronization for
    userspace dependency tracking.  And we have no way to tell the
    difference between the two internally.  Even if user space is passing
    around sync_files and trying to do explicit sync, once you get inside
    the kernel, they're all dma-fences and it can't tell the difference.

Funny, because 'lumped [the two] together' is exactly the crux of myissues ...


    If we move


Stop here, because ...

    to a more userspace-controlled synchronization model with
    wait-before-signal and no timeouts unless requested, regardless of the
    implementation, it plays really badly dma-fence.  And, by "badly" I
    mean the two are nearly incompatible.

I would go further than that, and say completely, fundamentally,conceptually, incompatible.

+1

    From a user space PoV, it means
    it's tricky to provide the finite time dma-fence guarantee. From a
    kernel PoV, it's way worse.  Currently, the way dma-fence is
    constructed, it's impossible to deadlock assuming everyone follows the
    rules.  The moment we allow user space to deadlock itself and allow
    those deadlocks to leak into the kernel, we have a problem. Even if
    we throw in some timeouts, we still have a scenario where user space
    has one linearizable dependency graph for execution synchronization
    and the kernel has a different linearizable dependency graph for
    memory management and, when you smash them together, you may have
    cycles in your graph.

    So how do we sort this all out?  Good question.  It's a hard problem.
    Probably the hardest problem here is the second one: the intermixing
    of synchronization types.  Solving that one is likely going to require
    some user space re-plumbing because all the user space APIs we have
    for explicit sync are built on dma-fence.


Gotcha.
Firstly, let's stop, as you say, lumping things together. Timelinesemaphores and compute's GPU-side spinlocks etc, are one thing. Iaccept those now have a hard requirement on something like memoryfences, where any responsibility is totally abrogated. So let's runwith that in our strawman: Vulkan compute & graphics & transfer queuesall degenerate to something spinning (hopefully GPU-assisted gentlespin) on a uint64 somewhere. The kernel has (in the general case) novisibility or responsibility into these things. Fine - that's one sideof the story.


Exactly, yes.

But winsys is something _completely_ different. Yes, you're using theGPU to do things with buffers A, B, and C to produce buffer Z. Yes,you're using vkQueuePresentKHR to schedule that work. Yes, Mutter'scomposition job might depend on a Chromium composition job whichdepends on GTA's render job which depends on GTA's compute job whichmight take a year to complete. Mutter's composition job needs tocomplete in 'reasonable' (again, FSVO) time, no matter what. The twoare compatible.
How? Don't lump them together. Isolate them aggressively, and_predictably_ in a way that you can reason about.
What clients do in their own process space is their ownbusiness. Games can deadlock themselves if they get wait-before-signalwrong. Compute jobs can run for a year. Their problem. Winsys is notthat, because you're crossing every isolation boundary possible.Process, user, container, VM - every kind of privilege boundary. Thusfar, dma_fence has protected us from the most egregious abuses byguaranteeing bounded-time completion; it also acts as a sequencingprimitive, but from the perspective of a winsys person that's ofsecondary importance, which is probably one of the bigger disconnectsbetween winsys people and GPU driver people.


Finally somebody who understands me :)

Well the question is then how do we get winsys and your own processspace together then?

Anyway, one of the great things about winsys (there are some! trustme) is we don't need to be as hopelessly general as for game engines,nor as hyperoptimised. We place strict demands on our clients, and weliterally kill them every single time they get something wrong in away that's visible to us. Our demands on the GPU are so embarrassinglysimple that you can run every modern desktop environment on GPUs whichdon't have unified shaders. And on certain platforms who don't sharetiling formats between texture/render-target/scanout ... and it allstill runs fast enough that people don't complain.

Ignoring everything below since that is the display pipeline I'm notreally interested in. My concern is how to get the buffer from theclient to the server without allowing the client to get the server intotrouble?

My thinking is still to use timeouts to acquire texture locks. E.g. whenthe compositor needs to access texture it grabs a lock and if that lockisn't available in less than 20ms whoever is holding it is killed hardand the lock given to the compositor.

It's perfectly fine if a process has a hung queue, but if it tries tosend buffers which should be filled by that queue to the compositor itjust gets a corrupted window content.


Regards,
Christian.

We're happy to bear the pain of being the ones setting strict andunreasonable expectations. To me, this 'present ioctl' falls into theuncanny valley of the kernel trying to bear too much of the weight tobe tractable, whilst not bearing enough of the weight to be useful forwinsys.
So here's my principles for a counter-strawman:
Remove the 'return fence'. Burn it with fire, do not look back. Modernpresentation pipelines are not necessarily 1:1, they are notnecessarily FIFO (as opposed to mailbox), and they are not necessarilyround-robin either. The current proposal provides no tangible benefitsto modern userspace, and fixing that requires either hobblinguserspace to remove capability and flexibility (ironic given that themotivation for this is all about userspace flexibility?), or pushingso much complexity into the kernel that we break it forever (you can'tcompile Mutter's per-frame decision tree into eBPF).
Give us a primitive representing work completion, so we can keepoptimistically pipelining operations. We're happy to pass aroundexplicit-synchronisation tokens (dma_fence, drm_syncobj, drm_newthing,whatever it is): plumbing through a sync token to synchronisecompositor operations against client operations in both directions isjust a matter of boring typing.
Make that primitive something that is every bit as usable acrosssubsystems as it is across processes. It should be a lowest commondenominator for middleware that ultimately provokes GPU execbuf, KMScommit, and media codec ops; currently that would be both wait andsignal for all of VkSemaphore, EGLSyncKHR, KMS fence, V4L (D)QBUF, andVA-API {en,de}code ops. It must be exportable to and importable froman FD, which can be poll()ed on and read(). GPU-side visibility forlate binding is nice, but not at all essential.
Make that primitive complete in 'reasonable' time, no matter what.There will always be failures in extremis, no matter what the design:absent hard-realtime principles from hardware all the way up touserspace, something will always be able to fail somewhere:non-terminating GPU work, actual GPU hang/reset, GPU queue DoSed, CPUscheduler, I/O DoSed. As long as the general case is bounded-timecompletion, each of these can be mitigated separately as long asuserspace has enough visibility into the underlying mechanics, andcares enough to take meaningful action on it.
And something more concrete:

dma_fence.
This already has all of the properties described above. Kernel-wise,it already devolves to CPU-side signaling when it crosses deviceboundaries. We need to support it roughly forever since it's beenplumbed so far and so wide. Any primitive which is acceptable forwinsys-like usage which crosses so manydevice/subsystem/process/security boundaries has to meet the samerequirements. So why reinvent something which looks so similar, andhas the same requirements of the kernel babysitting completion,providing little to no benefit for that difference?
It's not usable for complex usecases, as we've established, but winsysis not that usecase. We can draw a hard boundary between the twoworlds. For example, a client could submit an infinitely deep CS ->VS/FS/etc job chain with potentially-infinite completion, with the FSoutput being passed to the winsys for composition. Draw the linepost-FS: export a dma_fence against FS completion. But instead of thisbeing based on monitoring the _fence_ per se, base it on monitoringthe job; if the final job doesn't retire in reasonable time, signalthe fence and signal (like, SIGKILL, or just tear down the context andpermanently -EIO, whatever) the client. Maybe for future hardware thatwould be the same thing - the kernel setting a timeout and comparing aread on a particular address against a particular value - but the'present fence' proposal seems like it requires exactly this anyway.
That to me is the best compromise. We allow clients complete arbitraryflexibility, but as soon as they vkQueuePresentKHR, they're crossing aboundary out of happy fun GPU land and into strange hostile winsysland. We've got a lot of practice at being the bad guys who hate usersand are always trying to ruin their dreams, so we'll happily wear theimpact of continuing to do that. In doing so, we collectively don'thave to invent a third new synchronisation primitive (to add todma_fence and drm_syncobj) and a third new synchronisation model(implicit sync, explicit-but-bounded sync,explicit-and-maybe-unbounded sync) to support this, and we don't haveto do an NT4 where GDI was shoved into the kernel.
It doesn't help with the goal of ridding dma_fence from the kernel,but it does very clearly segregate the two worlds. Drawing that hardboundary would allow drivers to hyperoptimise for clients which wantto be extremely clever and agile and quick because they're sailing soclose to the wind that they cannot bear the overhead of dma_fence,whilst also providing the guarantees we need when crossing isolationboundaries. In the latter case, the overhead of bouncing into aless-optimised primitive is totally acceptable because it's not evenmeasurable: vkQueuePresentKHR requires client CPU activity -> kernelIPC -> compositor CPU activity -> wait for repaint cycle -> preparescene -> composition, against which dma_fence overhead isn't and willnever be measurable (even if it doesn't cross device/subsystemboundaries, which it probably does). And the converse forvkAcquireNextImageKHR.
tl;dr: we don't need to move winsys into the kernel, winsys andcompute don't need to share sync primitives, the client/winsysboundary does need to have a sync primitive does need strong andonerous guarantees, and that transition can be several orders ofmagnitude less efficient than intra-client sync primitives
Shoot me down. :)

Cheers,
Daniel

_______________________________________________
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev

_______________________________________________
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

Reply via email to