Am 20.04.21 um 19:44 schrieb Daniel Stone:
Hi,
On Tue, 20 Apr 2021 at 16:46, Jason Ekstrand <ja...@jlekstrand.net
<mailto:ja...@jlekstrand.net>> wrote:
It's still early in the morning here and I'm not awake yet so sorry if
this comes out in bits and pieces...
No problem, it's helpful. If I weren't on this thread I'd be
attempting to put together a 73-piece chest of drawers whose
instructions are about as clear as this so far, so I'm in the right
head space anyway.
IMO, there are two problems being solved here which are related in
very subtle and tricky ways. They're also, admittedly, driver
problems, not really winsys problems. Unfortunately, they may have
winsys implications.
Yeah ... bingo.
First, is better/real timelines for Vulkan and compute. [...]
We also want something like this for compute workloads. [...]
Totally understand and agree with all of this. Memory fences seem like
a good and useful primitive here.
Completely agree.
The second biting issue is that, in the current kernel implementation
of dma-fence and dma_resv, we've lumped internal synchronization for
memory management together with execution synchronization for
userspace dependency tracking. And we have no way to tell the
difference between the two internally. Even if user space is passing
around sync_files and trying to do explicit sync, once you get inside
the kernel, they're all dma-fences and it can't tell the difference.
Funny, because 'lumped [the two] together' is exactly the crux of my
issues ...
If we move
Stop here, because ...
to a more userspace-controlled synchronization model with
wait-before-signal and no timeouts unless requested, regardless of the
implementation, it plays really badly dma-fence. And, by "badly" I
mean the two are nearly incompatible.
I would go further than that, and say completely, fundamentally,
conceptually, incompatible.
+1
From a user space PoV, it means
it's tricky to provide the finite time dma-fence guarantee. From a
kernel PoV, it's way worse. Currently, the way dma-fence is
constructed, it's impossible to deadlock assuming everyone follows the
rules. The moment we allow user space to deadlock itself and allow
those deadlocks to leak into the kernel, we have a problem. Even if
we throw in some timeouts, we still have a scenario where user space
has one linearizable dependency graph for execution synchronization
and the kernel has a different linearizable dependency graph for
memory management and, when you smash them together, you may have
cycles in your graph.
So how do we sort this all out? Good question. It's a hard problem.
Probably the hardest problem here is the second one: the intermixing
of synchronization types. Solving that one is likely going to require
some user space re-plumbing because all the user space APIs we have
for explicit sync are built on dma-fence.
Gotcha.
Firstly, let's stop, as you say, lumping things together. Timeline
semaphores and compute's GPU-side spinlocks etc, are one thing. I
accept those now have a hard requirement on something like memory
fences, where any responsibility is totally abrogated. So let's run
with that in our strawman: Vulkan compute & graphics & transfer queues
all degenerate to something spinning (hopefully GPU-assisted gentle
spin) on a uint64 somewhere. The kernel has (in the general case) no
visibility or responsibility into these things. Fine - that's one side
of the story.
Exactly, yes.
But winsys is something _completely_ different. Yes, you're using the
GPU to do things with buffers A, B, and C to produce buffer Z. Yes,
you're using vkQueuePresentKHR to schedule that work. Yes, Mutter's
composition job might depend on a Chromium composition job which
depends on GTA's render job which depends on GTA's compute job which
might take a year to complete. Mutter's composition job needs to
complete in 'reasonable' (again, FSVO) time, no matter what. The two
are compatible.
How? Don't lump them together. Isolate them aggressively, and
_predictably_ in a way that you can reason about.
What clients do in their own process space is their own
business. Games can deadlock themselves if they get wait-before-signal
wrong. Compute jobs can run for a year. Their problem. Winsys is not
that, because you're crossing every isolation boundary possible.
Process, user, container, VM - every kind of privilege boundary. Thus
far, dma_fence has protected us from the most egregious abuses by
guaranteeing bounded-time completion; it also acts as a sequencing
primitive, but from the perspective of a winsys person that's of
secondary importance, which is probably one of the bigger disconnects
between winsys people and GPU driver people.
Finally somebody who understands me :)
Well the question is then how do we get winsys and your own process
space together then?
Anyway, one of the great things about winsys (there are some! trust
me) is we don't need to be as hopelessly general as for game engines,
nor as hyperoptimised. We place strict demands on our clients, and we
literally kill them every single time they get something wrong in a
way that's visible to us. Our demands on the GPU are so embarrassingly
simple that you can run every modern desktop environment on GPUs which
don't have unified shaders. And on certain platforms who don't share
tiling formats between texture/render-target/scanout ... and it all
still runs fast enough that people don't complain.
Ignoring everything below since that is the display pipeline I'm not
really interested in. My concern is how to get the buffer from the
client to the server without allowing the client to get the server into
trouble?
My thinking is still to use timeouts to acquire texture locks. E.g. when
the compositor needs to access texture it grabs a lock and if that lock
isn't available in less than 20ms whoever is holding it is killed hard
and the lock given to the compositor.
It's perfectly fine if a process has a hung queue, but if it tries to
send buffers which should be filled by that queue to the compositor it
just gets a corrupted window content.
Regards,
Christian.
We're happy to bear the pain of being the ones setting strict and
unreasonable expectations. To me, this 'present ioctl' falls into the
uncanny valley of the kernel trying to bear too much of the weight to
be tractable, whilst not bearing enough of the weight to be useful for
winsys.
So here's my principles for a counter-strawman:
Remove the 'return fence'. Burn it with fire, do not look back. Modern
presentation pipelines are not necessarily 1:1, they are not
necessarily FIFO (as opposed to mailbox), and they are not necessarily
round-robin either. The current proposal provides no tangible benefits
to modern userspace, and fixing that requires either hobbling
userspace to remove capability and flexibility (ironic given that the
motivation for this is all about userspace flexibility?), or pushing
so much complexity into the kernel that we break it forever (you can't
compile Mutter's per-frame decision tree into eBPF).
Give us a primitive representing work completion, so we can keep
optimistically pipelining operations. We're happy to pass around
explicit-synchronisation tokens (dma_fence, drm_syncobj, drm_newthing,
whatever it is): plumbing through a sync token to synchronise
compositor operations against client operations in both directions is
just a matter of boring typing.
Make that primitive something that is every bit as usable across
subsystems as it is across processes. It should be a lowest common
denominator for middleware that ultimately provokes GPU execbuf, KMS
commit, and media codec ops; currently that would be both wait and
signal for all of VkSemaphore, EGLSyncKHR, KMS fence, V4L (D)QBUF, and
VA-API {en,de}code ops. It must be exportable to and importable from
an FD, which can be poll()ed on and read(). GPU-side visibility for
late binding is nice, but not at all essential.
Make that primitive complete in 'reasonable' time, no matter what.
There will always be failures in extremis, no matter what the design:
absent hard-realtime principles from hardware all the way up to
userspace, something will always be able to fail somewhere:
non-terminating GPU work, actual GPU hang/reset, GPU queue DoSed, CPU
scheduler, I/O DoSed. As long as the general case is bounded-time
completion, each of these can be mitigated separately as long as
userspace has enough visibility into the underlying mechanics, and
cares enough to take meaningful action on it.
And something more concrete:
dma_fence.
This already has all of the properties described above. Kernel-wise,
it already devolves to CPU-side signaling when it crosses device
boundaries. We need to support it roughly forever since it's been
plumbed so far and so wide. Any primitive which is acceptable for
winsys-like usage which crosses so many
device/subsystem/process/security boundaries has to meet the same
requirements. So why reinvent something which looks so similar, and
has the same requirements of the kernel babysitting completion,
providing little to no benefit for that difference?
It's not usable for complex usecases, as we've established, but winsys
is not that usecase. We can draw a hard boundary between the two
worlds. For example, a client could submit an infinitely deep CS ->
VS/FS/etc job chain with potentially-infinite completion, with the FS
output being passed to the winsys for composition. Draw the line
post-FS: export a dma_fence against FS completion. But instead of this
being based on monitoring the _fence_ per se, base it on monitoring
the job; if the final job doesn't retire in reasonable time, signal
the fence and signal (like, SIGKILL, or just tear down the context and
permanently -EIO, whatever) the client. Maybe for future hardware that
would be the same thing - the kernel setting a timeout and comparing a
read on a particular address against a particular value - but the
'present fence' proposal seems like it requires exactly this anyway.
That to me is the best compromise. We allow clients complete arbitrary
flexibility, but as soon as they vkQueuePresentKHR, they're crossing a
boundary out of happy fun GPU land and into strange hostile winsys
land. We've got a lot of practice at being the bad guys who hate users
and are always trying to ruin their dreams, so we'll happily wear the
impact of continuing to do that. In doing so, we collectively don't
have to invent a third new synchronisation primitive (to add to
dma_fence and drm_syncobj) and a third new synchronisation model
(implicit sync, explicit-but-bounded sync,
explicit-and-maybe-unbounded sync) to support this, and we don't have
to do an NT4 where GDI was shoved into the kernel.
It doesn't help with the goal of ridding dma_fence from the kernel,
but it does very clearly segregate the two worlds. Drawing that hard
boundary would allow drivers to hyperoptimise for clients which want
to be extremely clever and agile and quick because they're sailing so
close to the wind that they cannot bear the overhead of dma_fence,
whilst also providing the guarantees we need when crossing isolation
boundaries. In the latter case, the overhead of bouncing into a
less-optimised primitive is totally acceptable because it's not even
measurable: vkQueuePresentKHR requires client CPU activity -> kernel
IPC -> compositor CPU activity -> wait for repaint cycle -> prepare
scene -> composition, against which dma_fence overhead isn't and will
never be measurable (even if it doesn't cross device/subsystem
boundaries, which it probably does). And the converse for
vkAcquireNextImageKHR.
tl;dr: we don't need to move winsys into the kernel, winsys and
compute don't need to share sync primitives, the client/winsys
boundary does need to have a sync primitive does need strong and
onerous guarantees, and that transition can be several orders of
magnitude less efficient than intra-client sync primitives
Shoot me down. :)
Cheers,
Daniel
_______________________________________________
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev
_______________________________________________
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev