Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-01 Thread Daniel Stone
Hi,

On Tue, 1 Jun 2021 at 14:18, Michel Dänzer  wrote:
> On 2021-06-01 2:10 p.m., Christian König wrote:
> > Am 01.06.21 um 12:49 schrieb Michel Dänzer:
> >> There isn't a choice for Wayland compositors in general, since there can 
> >> be arbitrary other state which needs to be applied atomically together 
> >> with the new buffer. (Though in theory, a compositor might get fancy and 
> >> special-case surface commits which can be handled by waiting on the GPU)

Yeah, this is pretty crucial.

> >> Latency is largely a matter of scheduling in the compositor. The latency 
> >> incurred by the compositor shouldn't have to be more than single-digit 
> >> milliseconds. (I've seen total latency from when the client starts 
> >> processing a (static) frame to when it starts being scanned out as low as 
> >> ~6 ms with https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1620, 
> >> lower than typical with Xorg)
> >
> > Well let me describe it like this:
> >
> > We have an use cases for 144 Hz guaranteed refresh rate. That essentially 
> > means that the client application needs to be able to spit out one 
> > frame/window content every ~6.9ms. That's tough, but doable.
> >
> > When you now add 6ms latency in the compositor that means the client 
> > application has only .9ms left for it's frame which is basically impossible 
> > to do.
>
> You misunderstood me. 6 ms is the lowest possible end-to-end latency from 
> client to scanout, but the client can start as early as it wants/needs to. 
> It's a trade-off between latency and the risk of missing a scanout cycle.

Not quite.

When weston-presentation-shm is reporting is a 6ms delta between when
it started its rendering loop and when the frame was presented to
screen. How w-p-s was run matters a lot, because you can insert an
arbitrary delay in there to simulate client rendering. It also matters
a lot that the client is SHM, because that will result in Mutter doing
glTexImage2D on whatever size the window is, then doing a full GL
compositing pass, so even if it's run with zero delay, 6ms isn't 'the
amount of time it takes Mutter to get a frame to screen', it's
measuring the overhead of a texture upload and full-screen composition
as well.

I'm assuming the 'guaranteed 144Hz' target is a fullscreen GL client,
for which you definitely avoid TexImage2D, and could hopefully
(assuming the client picks a modifier which can be displayed) also
avoid the composition pass in favour of direct scanout from the client
buffer; that would give you a much lower number.

Each compositor has its own heuristics around timing. They all make
their own tradeoff between low latency and fewest dropped frames.
Obviously, the higher your latency, the lower the chance of missing a
deadline. There's a lot of detail in the MR that Michel linked.

Cheers,
Daniel
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-01 Thread Michel Dänzer
On 2021-06-01 3:18 p.m., Michel Dänzer wrote:
> On 2021-06-01 2:10 p.m., Christian König wrote:
>> Am 01.06.21 um 12:49 schrieb Michel Dänzer:
>>> On 2021-06-01 12:21 p.m., Christian König wrote:
 Am 01.06.21 um 11:02 schrieb Michel Dänzer:
> On 2021-05-27 11:51 p.m., Marek Olšák wrote:
>> 3) Compositors (and other privileged processes, and display flipping) 
>> can't trust imported/exported fences. They need a timeout recovery 
>> mechanism from the beginning, and the following are some possible 
>> solutions to timeouts:
>>
>> a) use a CPU wait with a small absolute timeout, and display the 
>> previous content on timeout
>> b) use a GPU wait with a small absolute timeout, and conditional 
>> rendering will choose between the latest content (if signalled) and 
>> previous content (if timed out)
>>
>> The result would be that the desktop can run close to 60 fps even if an 
>> app runs at 1 fps.
> FWIW, this is working with
> https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1880 , even with 
> implicit sync (on current Intel GPUs; amdgpu/radeonsi would need to 
> provide the same dma-buf poll semantics as other drivers and high 
> priority GFX contexts via EGL_IMG_context_priority which can preempt 
> lower priority ones).
 Yeah, that is really nice to have.

 One question is if you wait on the CPU or the GPU for the new surface to 
 become available?
>>> It's based on polling dma-buf fds, i.e. CPU.
>>>
 The former is a bit bad for latency and power management.
>>> There isn't a choice for Wayland compositors in general, since there can be 
>>> arbitrary other state which needs to be applied atomically together with 
>>> the new buffer. (Though in theory, a compositor might get fancy and 
>>> special-case surface commits which can be handled by waiting on the GPU)
>>>
>>> Latency is largely a matter of scheduling in the compositor. The latency 
>>> incurred by the compositor shouldn't have to be more than single-digit 
>>> milliseconds. (I've seen total latency from when the client starts 
>>> processing a (static) frame to when it starts being scanned out as low as 
>>> ~6 ms with https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1620, 
>>> lower than typical with Xorg)
>>
>> Well let me describe it like this:
>>
>> We have an use cases for 144 Hz guaranteed refresh rate. That essentially 
>> means that the client application needs to be able to spit out one 
>> frame/window content every ~6.9ms. That's tough, but doable.
>>
>> When you now add 6ms latency in the compositor that means the client 
>> application has only .9ms left for it's frame which is basically impossible 
>> to do.
> 
> You misunderstood me. 6 ms is the lowest possible end-to-end latency from 
> client to scanout, but the client can start as early as it wants/needs to. 
> It's a trade-off between latency and the risk of missing a scanout cycle.

Note that what I wrote above is about the case where the compositor needs to 
draw its own frame sampling from the client buffer. If your concern is about a 
fullscreen application for which the compositor can directly use the 
application buffers for scanout, it should be possible in theory to get the 
latency incurred by the compositor down to ~1 ms.

If that's too much[0], it could be improved further by adding atomic KMS API to 
replace a pending page flip with another one. Then the compositor could just 
directly submit a flip as soon as a new buffer becomes ready (or even as soon 
as the client submits it to the compositor, depending on how exactly the new 
KMS API works). Then the minimum latency should be mostly up to the kernel 
driver / HW.

Another possibility would be for the application to use KMS directly, e.g. via 
a DRM lease. That might still require the same new API to get the flip 
submission latency significantly below 1 ms though.


[0] Though I'm not sure how to reconcile that with "spitting out one frame 
every ~6.9ms is tough", as that means the theoretical minimum total 
client→scanout latency is ~7 ms (and missing a scanout cycle ~doubles the 
latency).

-- 
Earthling Michel Dänzer   |   https://redhat.com
Libre software enthusiast | Mesa and X developer
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-01 Thread Michel Dänzer
On 2021-06-01 3:01 p.m., Marek Olšák wrote:
> 
> 
> On Tue., Jun. 1, 2021, 08:51 Christian König, 
> mailto:ckoenig.leichtzumer...@gmail.com>> 
> wrote:
> 
> Am 01.06.21 um 14:30 schrieb Daniel Vetter:
> > On Tue, Jun 1, 2021 at 2:10 PM Christian König
> >  > wrote:
> >> Am 01.06.21 um 12:49 schrieb Michel Dänzer:
> >>> On 2021-06-01 12:21 p.m., Christian König wrote:
>  Am 01.06.21 um 11:02 schrieb Michel Dänzer:
> > On 2021-05-27 11:51 p.m., Marek Olšák wrote:
> >> 3) Compositors (and other privileged processes, and display 
> flipping) can't trust imported/exported fences. They need a timeout recovery 
> mechanism from the beginning, and the following are some possible solutions 
> to timeouts:
> >>
> >> a) use a CPU wait with a small absolute timeout, and display the 
> previous content on timeout
> >> b) use a GPU wait with a small absolute timeout, and conditional 
> rendering will choose between the latest content (if signalled) and previous 
> content (if timed out)
> >>
> >> The result would be that the desktop can run close to 60 fps even 
> if an app runs at 1 fps.
> > FWIW, this is working with
> > https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1880 
>  , even with 
> implicit sync (on current Intel GPUs; amdgpu/radeonsi would need to provide 
> the same dma-buf poll semantics as other drivers and high priority GFX 
> contexts via EGL_IMG_context_priority which can preempt lower priority ones).
>  Yeah, that is really nice to have.
> 
>  One question is if you wait on the CPU or the GPU for the new 
> surface to become available?
> >>> It's based on polling dma-buf fds, i.e. CPU.
> >>>
>  The former is a bit bad for latency and power management.
> >>> There isn't a choice for Wayland compositors in general, since there 
> can be arbitrary other state which needs to be applied atomically together 
> with the new buffer. (Though in theory, a compositor might get fancy and 
> special-case surface commits which can be handled by waiting on the GPU)
> >>>
> >>> Latency is largely a matter of scheduling in the compositor. The 
> latency incurred by the compositor shouldn't have to be more than 
> single-digit milliseconds. (I've seen total latency from when the client 
> starts processing a (static) frame to when it starts being scanned out as low 
> as ~6 ms with https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1620 
> , lower than 
> typical with Xorg)
> >> Well let me describe it like this:
> >>
> >> We have an use cases for 144 Hz guaranteed refresh rate. That
> >> essentially means that the client application needs to be able to spit
> >> out one frame/window content every ~6.9ms. That's tough, but doable.
> >>
> >> When you now add 6ms latency in the compositor that means the client
> >> application has only .9ms left for it's frame which is basically
> >> impossible to do.
> >>
> >> See for the user fences handling the display engine will learn to read
> >> sequence numbers from memory and decide on it's own if the old frame or
> >> the new one is scanned out. To get the latency there as low as 
> possible.
> > This won't work with implicit sync at all.
> >
> > If you want to enable this use-case with driver magic and without the
> > compositor being aware of what's going on, the solution is EGLStreams.
> > Not sure we want to go there, but it's definitely a lot more feasible
> > than trying to stuff eglstreams semantics into dma-buf implicit
> > fencing support in a desperate attempt to not change compositors.

EGLStreams are a red herring here. Wayland has atomic state transactions, 
similar to the atomic KMS API. These semantics could be achieved even with 
EGLStreams, at least via additional EGL extensions.

Any fancy technique we're discussing here would have to be completely between 
the Wayland compositor and the kernel, transparent to anything else.


> > I still think the most reasonable approach here is that we wrap a
> > dma_fence compat layer/mode over new hw for existing
> > userspace/compositors. And then enable userspace memory fences and the
> > fancy new features those allow with a new model that's built for them.
> 
> Yeah, that's basically the same direction I'm heading. Question is how
> to fix all those details.
> 
> > Also even with dma_fence we could implement your model of staying with
> > the previous buffer (or an older buffer at that's already rendered),
> > but it needs explicit involvement of the compositor. At least without
> > adding eglstreams fd to the kernel and wiring up all the relevant
> > extensions.
> 
> 

Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-01 Thread Michel Dänzer
On 2021-06-01 2:10 p.m., Christian König wrote:
> Am 01.06.21 um 12:49 schrieb Michel Dänzer:
>> On 2021-06-01 12:21 p.m., Christian König wrote:
>>> Am 01.06.21 um 11:02 schrieb Michel Dänzer:
 On 2021-05-27 11:51 p.m., Marek Olšák wrote:
> 3) Compositors (and other privileged processes, and display flipping) 
> can't trust imported/exported fences. They need a timeout recovery 
> mechanism from the beginning, and the following are some possible 
> solutions to timeouts:
>
> a) use a CPU wait with a small absolute timeout, and display the previous 
> content on timeout
> b) use a GPU wait with a small absolute timeout, and conditional 
> rendering will choose between the latest content (if signalled) and 
> previous content (if timed out)
>
> The result would be that the desktop can run close to 60 fps even if an 
> app runs at 1 fps.
 FWIW, this is working with
 https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1880 , even with 
 implicit sync (on current Intel GPUs; amdgpu/radeonsi would need to 
 provide the same dma-buf poll semantics as other drivers and high priority 
 GFX contexts via EGL_IMG_context_priority which can preempt lower priority 
 ones).
>>> Yeah, that is really nice to have.
>>>
>>> One question is if you wait on the CPU or the GPU for the new surface to 
>>> become available?
>> It's based on polling dma-buf fds, i.e. CPU.
>>
>>> The former is a bit bad for latency and power management.
>> There isn't a choice for Wayland compositors in general, since there can be 
>> arbitrary other state which needs to be applied atomically together with the 
>> new buffer. (Though in theory, a compositor might get fancy and special-case 
>> surface commits which can be handled by waiting on the GPU)
>>
>> Latency is largely a matter of scheduling in the compositor. The latency 
>> incurred by the compositor shouldn't have to be more than single-digit 
>> milliseconds. (I've seen total latency from when the client starts 
>> processing a (static) frame to when it starts being scanned out as low as ~6 
>> ms with https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1620, lower 
>> than typical with Xorg)
> 
> Well let me describe it like this:
> 
> We have an use cases for 144 Hz guaranteed refresh rate. That essentially 
> means that the client application needs to be able to spit out one 
> frame/window content every ~6.9ms. That's tough, but doable.
> 
> When you now add 6ms latency in the compositor that means the client 
> application has only .9ms left for it's frame which is basically impossible 
> to do.

You misunderstood me. 6 ms is the lowest possible end-to-end latency from 
client to scanout, but the client can start as early as it wants/needs to. It's 
a trade-off between latency and the risk of missing a scanout cycle.


-- 
Earthling Michel Dänzer   |   https://redhat.com
Libre software enthusiast | Mesa and X developer
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-01 Thread Marek Olšák
On Tue., Jun. 1, 2021, 08:51 Christian König, <
ckoenig.leichtzumer...@gmail.com> wrote:

> Am 01.06.21 um 14:30 schrieb Daniel Vetter:
> > On Tue, Jun 1, 2021 at 2:10 PM Christian König
> >  wrote:
> >> Am 01.06.21 um 12:49 schrieb Michel Dänzer:
> >>> On 2021-06-01 12:21 p.m., Christian König wrote:
>  Am 01.06.21 um 11:02 schrieb Michel Dänzer:
> > On 2021-05-27 11:51 p.m., Marek Olšák wrote:
> >> 3) Compositors (and other privileged processes, and display
> flipping) can't trust imported/exported fences. They need a timeout
> recovery mechanism from the beginning, and the following are some possible
> solutions to timeouts:
> >>
> >> a) use a CPU wait with a small absolute timeout, and display the
> previous content on timeout
> >> b) use a GPU wait with a small absolute timeout, and conditional
> rendering will choose between the latest content (if signalled) and
> previous content (if timed out)
> >>
> >> The result would be that the desktop can run close to 60 fps even
> if an app runs at 1 fps.
> > FWIW, this is working with
> > https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1880 , even
> with implicit sync (on current Intel GPUs; amdgpu/radeonsi would need to
> provide the same dma-buf poll semantics as other drivers and high priority
> GFX contexts via EGL_IMG_context_priority which can preempt lower priority
> ones).
>  Yeah, that is really nice to have.
> 
>  One question is if you wait on the CPU or the GPU for the new surface
> to become available?
> >>> It's based on polling dma-buf fds, i.e. CPU.
> >>>
>  The former is a bit bad for latency and power management.
> >>> There isn't a choice for Wayland compositors in general, since there
> can be arbitrary other state which needs to be applied atomically together
> with the new buffer. (Though in theory, a compositor might get fancy and
> special-case surface commits which can be handled by waiting on the GPU)
> >>>
> >>> Latency is largely a matter of scheduling in the compositor. The
> latency incurred by the compositor shouldn't have to be more than
> single-digit milliseconds. (I've seen total latency from when the client
> starts processing a (static) frame to when it starts being scanned out as
> low as ~6 ms with
> https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1620, lower than
> typical with Xorg)
> >> Well let me describe it like this:
> >>
> >> We have an use cases for 144 Hz guaranteed refresh rate. That
> >> essentially means that the client application needs to be able to spit
> >> out one frame/window content every ~6.9ms. That's tough, but doable.
> >>
> >> When you now add 6ms latency in the compositor that means the client
> >> application has only .9ms left for it's frame which is basically
> >> impossible to do.
> >>
> >> See for the user fences handling the display engine will learn to read
> >> sequence numbers from memory and decide on it's own if the old frame or
> >> the new one is scanned out. To get the latency there as low as possible.
> > This won't work with implicit sync at all.
> >
> > If you want to enable this use-case with driver magic and without the
> > compositor being aware of what's going on, the solution is EGLStreams.
> > Not sure we want to go there, but it's definitely a lot more feasible
> > than trying to stuff eglstreams semantics into dma-buf implicit
> > fencing support in a desperate attempt to not change compositors.
>
> Well not changing compositors is certainly not something I would try
> with this use case.
>
> Not changing compositors is more like ok we have Ubuntu 20.04 and need
> to support that we the newest hardware generation.
>
> > I still think the most reasonable approach here is that we wrap a
> > dma_fence compat layer/mode over new hw for existing
> > userspace/compositors. And then enable userspace memory fences and the
> > fancy new features those allow with a new model that's built for them.
>
> Yeah, that's basically the same direction I'm heading. Question is how
> to fix all those details.
>
> > Also even with dma_fence we could implement your model of staying with
> > the previous buffer (or an older buffer at that's already rendered),
> > but it needs explicit involvement of the compositor. At least without
> > adding eglstreams fd to the kernel and wiring up all the relevant
> > extensions.
>
> Question is do we already have some extension which allows different
> textures to be selected on the fly depending on some state?
>

There is no such extension for sync objects, but it can be done with
queries, like occlusion queries. There is also no timeout option and it can
only do "if" and "if not", but not "if .. else"

Marek



> E.g. something like use new frame if it's available and old frame
> otherwise.
>
> If you then apply this to the standard dma_fence based hardware or the
> new user fence based one is then pretty much irrelevant.
>
> Regards,
> Christian.
>
> > -Daniel
> >
>  Another 

Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-01 Thread Christian König

Am 01.06.21 um 14:30 schrieb Daniel Vetter:

On Tue, Jun 1, 2021 at 2:10 PM Christian König
 wrote:

Am 01.06.21 um 12:49 schrieb Michel Dänzer:

On 2021-06-01 12:21 p.m., Christian König wrote:

Am 01.06.21 um 11:02 schrieb Michel Dänzer:

On 2021-05-27 11:51 p.m., Marek Olšák wrote:

3) Compositors (and other privileged processes, and display flipping) can't 
trust imported/exported fences. They need a timeout recovery mechanism from the 
beginning, and the following are some possible solutions to timeouts:

a) use a CPU wait with a small absolute timeout, and display the previous 
content on timeout
b) use a GPU wait with a small absolute timeout, and conditional rendering will 
choose between the latest content (if signalled) and previous content (if timed 
out)

The result would be that the desktop can run close to 60 fps even if an app 
runs at 1 fps.

FWIW, this is working with
https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1880 , even with 
implicit sync (on current Intel GPUs; amdgpu/radeonsi would need to provide the 
same dma-buf poll semantics as other drivers and high priority GFX contexts via 
EGL_IMG_context_priority which can preempt lower priority ones).

Yeah, that is really nice to have.

One question is if you wait on the CPU or the GPU for the new surface to become 
available?

It's based on polling dma-buf fds, i.e. CPU.


The former is a bit bad for latency and power management.

There isn't a choice for Wayland compositors in general, since there can be 
arbitrary other state which needs to be applied atomically together with the 
new buffer. (Though in theory, a compositor might get fancy and special-case 
surface commits which can be handled by waiting on the GPU)

Latency is largely a matter of scheduling in the compositor. The latency 
incurred by the compositor shouldn't have to be more than single-digit 
milliseconds. (I've seen total latency from when the client starts processing a 
(static) frame to when it starts being scanned out as low as ~6 ms with 
https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1620, lower than typical 
with Xorg)

Well let me describe it like this:

We have an use cases for 144 Hz guaranteed refresh rate. That
essentially means that the client application needs to be able to spit
out one frame/window content every ~6.9ms. That's tough, but doable.

When you now add 6ms latency in the compositor that means the client
application has only .9ms left for it's frame which is basically
impossible to do.

See for the user fences handling the display engine will learn to read
sequence numbers from memory and decide on it's own if the old frame or
the new one is scanned out. To get the latency there as low as possible.

This won't work with implicit sync at all.

If you want to enable this use-case with driver magic and without the
compositor being aware of what's going on, the solution is EGLStreams.
Not sure we want to go there, but it's definitely a lot more feasible
than trying to stuff eglstreams semantics into dma-buf implicit
fencing support in a desperate attempt to not change compositors.


Well not changing compositors is certainly not something I would try 
with this use case.


Not changing compositors is more like ok we have Ubuntu 20.04 and need 
to support that we the newest hardware generation.



I still think the most reasonable approach here is that we wrap a
dma_fence compat layer/mode over new hw for existing
userspace/compositors. And then enable userspace memory fences and the
fancy new features those allow with a new model that's built for them.


Yeah, that's basically the same direction I'm heading. Question is how 
to fix all those details.



Also even with dma_fence we could implement your model of staying with
the previous buffer (or an older buffer at that's already rendered),
but it needs explicit involvement of the compositor. At least without
adding eglstreams fd to the kernel and wiring up all the relevant
extensions.


Question is do we already have some extension which allows different 
textures to be selected on the fly depending on some state?


E.g. something like use new frame if it's available and old frame otherwise.

If you then apply this to the standard dma_fence based hardware or the 
new user fence based one is then pretty much irrelevant.


Regards,
Christian.


-Daniel


Another question is if that is sufficient as security for the display server or 
if we need further handling down the road? I mean essentially we are moving the 
reliability problem into the display server.

Good question. This should generally protect the display server from freezing 
due to client fences never signalling, but there might still be corner cases.






___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-01 Thread Daniel Vetter
On Tue, Jun 1, 2021 at 2:10 PM Christian König
 wrote:
>
> Am 01.06.21 um 12:49 schrieb Michel Dänzer:
> > On 2021-06-01 12:21 p.m., Christian König wrote:
> >> Am 01.06.21 um 11:02 schrieb Michel Dänzer:
> >>> On 2021-05-27 11:51 p.m., Marek Olšák wrote:
>  3) Compositors (and other privileged processes, and display flipping) 
>  can't trust imported/exported fences. They need a timeout recovery 
>  mechanism from the beginning, and the following are some possible 
>  solutions to timeouts:
> 
>  a) use a CPU wait with a small absolute timeout, and display the 
>  previous content on timeout
>  b) use a GPU wait with a small absolute timeout, and conditional 
>  rendering will choose between the latest content (if signalled) and 
>  previous content (if timed out)
> 
>  The result would be that the desktop can run close to 60 fps even if an 
>  app runs at 1 fps.
> >>> FWIW, this is working with
> >>> https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1880 , even with 
> >>> implicit sync (on current Intel GPUs; amdgpu/radeonsi would need to 
> >>> provide the same dma-buf poll semantics as other drivers and high 
> >>> priority GFX contexts via EGL_IMG_context_priority which can preempt 
> >>> lower priority ones).
> >> Yeah, that is really nice to have.
> >>
> >> One question is if you wait on the CPU or the GPU for the new surface to 
> >> become available?
> > It's based on polling dma-buf fds, i.e. CPU.
> >
> >> The former is a bit bad for latency and power management.
> > There isn't a choice for Wayland compositors in general, since there can be 
> > arbitrary other state which needs to be applied atomically together with 
> > the new buffer. (Though in theory, a compositor might get fancy and 
> > special-case surface commits which can be handled by waiting on the GPU)
> >
> > Latency is largely a matter of scheduling in the compositor. The latency 
> > incurred by the compositor shouldn't have to be more than single-digit 
> > milliseconds. (I've seen total latency from when the client starts 
> > processing a (static) frame to when it starts being scanned out as low as 
> > ~6 ms with https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1620, 
> > lower than typical with Xorg)
>
> Well let me describe it like this:
>
> We have an use cases for 144 Hz guaranteed refresh rate. That
> essentially means that the client application needs to be able to spit
> out one frame/window content every ~6.9ms. That's tough, but doable.
>
> When you now add 6ms latency in the compositor that means the client
> application has only .9ms left for it's frame which is basically
> impossible to do.
>
> See for the user fences handling the display engine will learn to read
> sequence numbers from memory and decide on it's own if the old frame or
> the new one is scanned out. To get the latency there as low as possible.

This won't work with implicit sync at all.

If you want to enable this use-case with driver magic and without the
compositor being aware of what's going on, the solution is EGLStreams.
Not sure we want to go there, but it's definitely a lot more feasible
than trying to stuff eglstreams semantics into dma-buf implicit
fencing support in a desperate attempt to not change compositors.

I still think the most reasonable approach here is that we wrap a
dma_fence compat layer/mode over new hw for existing
userspace/compositors. And then enable userspace memory fences and the
fancy new features those allow with a new model that's built for them.
Also even with dma_fence we could implement your model of staying with
the previous buffer (or an older buffer at that's already rendered),
but it needs explicit involvement of the compositor. At least without
adding eglstreams fd to the kernel and wiring up all the relevant
extensions.
-Daniel

> >> Another question is if that is sufficient as security for the display 
> >> server or if we need further handling down the road? I mean essentially we 
> >> are moving the reliability problem into the display server.
> > Good question. This should generally protect the display server from 
> > freezing due to client fences never signalling, but there might still be 
> > corner cases.
> >
> >
>


-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-01 Thread Christian König

Am 01.06.21 um 12:49 schrieb Michel Dänzer:

On 2021-06-01 12:21 p.m., Christian König wrote:

Am 01.06.21 um 11:02 schrieb Michel Dänzer:

On 2021-05-27 11:51 p.m., Marek Olšák wrote:

3) Compositors (and other privileged processes, and display flipping) can't 
trust imported/exported fences. They need a timeout recovery mechanism from the 
beginning, and the following are some possible solutions to timeouts:

a) use a CPU wait with a small absolute timeout, and display the previous 
content on timeout
b) use a GPU wait with a small absolute timeout, and conditional rendering will 
choose between the latest content (if signalled) and previous content (if timed 
out)

The result would be that the desktop can run close to 60 fps even if an app 
runs at 1 fps.

FWIW, this is working with
https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1880 , even with 
implicit sync (on current Intel GPUs; amdgpu/radeonsi would need to provide the 
same dma-buf poll semantics as other drivers and high priority GFX contexts via 
EGL_IMG_context_priority which can preempt lower priority ones).

Yeah, that is really nice to have.

One question is if you wait on the CPU or the GPU for the new surface to become 
available?

It's based on polling dma-buf fds, i.e. CPU.


The former is a bit bad for latency and power management.

There isn't a choice for Wayland compositors in general, since there can be 
arbitrary other state which needs to be applied atomically together with the 
new buffer. (Though in theory, a compositor might get fancy and special-case 
surface commits which can be handled by waiting on the GPU)

Latency is largely a matter of scheduling in the compositor. The latency 
incurred by the compositor shouldn't have to be more than single-digit 
milliseconds. (I've seen total latency from when the client starts processing a 
(static) frame to when it starts being scanned out as low as ~6 ms with 
https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1620, lower than typical 
with Xorg)


Well let me describe it like this:

We have an use cases for 144 Hz guaranteed refresh rate. That 
essentially means that the client application needs to be able to spit 
out one frame/window content every ~6.9ms. That's tough, but doable.


When you now add 6ms latency in the compositor that means the client 
application has only .9ms left for it's frame which is basically 
impossible to do.


See for the user fences handling the display engine will learn to read 
sequence numbers from memory and decide on it's own if the old frame or 
the new one is scanned out. To get the latency there as low as possible.



Another question is if that is sufficient as security for the display server or 
if we need further handling down the road? I mean essentially we are moving the 
reliability problem into the display server.

Good question. This should generally protect the display server from freezing 
due to client fences never signalling, but there might still be corner cases.




___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-01 Thread Michel Dänzer
On 2021-06-01 12:21 p.m., Christian König wrote:
> Am 01.06.21 um 11:02 schrieb Michel Dänzer:
>> On 2021-05-27 11:51 p.m., Marek Olšák wrote:
>>> 3) Compositors (and other privileged processes, and display flipping) can't 
>>> trust imported/exported fences. They need a timeout recovery mechanism from 
>>> the beginning, and the following are some possible solutions to timeouts:
>>>
>>> a) use a CPU wait with a small absolute timeout, and display the previous 
>>> content on timeout
>>> b) use a GPU wait with a small absolute timeout, and conditional rendering 
>>> will choose between the latest content (if signalled) and previous content 
>>> (if timed out)
>>>
>>> The result would be that the desktop can run close to 60 fps even if an app 
>>> runs at 1 fps.
>> FWIW, this is working with
>> https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1880 , even with 
>> implicit sync (on current Intel GPUs; amdgpu/radeonsi would need to provide 
>> the same dma-buf poll semantics as other drivers and high priority GFX 
>> contexts via EGL_IMG_context_priority which can preempt lower priority ones).
> 
> Yeah, that is really nice to have.
> 
> One question is if you wait on the CPU or the GPU for the new surface to 
> become available?

It's based on polling dma-buf fds, i.e. CPU.

> The former is a bit bad for latency and power management.

There isn't a choice for Wayland compositors in general, since there can be 
arbitrary other state which needs to be applied atomically together with the 
new buffer. (Though in theory, a compositor might get fancy and special-case 
surface commits which can be handled by waiting on the GPU)

Latency is largely a matter of scheduling in the compositor. The latency 
incurred by the compositor shouldn't have to be more than single-digit 
milliseconds. (I've seen total latency from when the client starts processing a 
(static) frame to when it starts being scanned out as low as ~6 ms with 
https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1620, lower than typical 
with Xorg)


> Another question is if that is sufficient as security for the display server 
> or if we need further handling down the road? I mean essentially we are 
> moving the reliability problem into the display server.

Good question. This should generally protect the display server from freezing 
due to client fences never signalling, but there might still be corner cases.


-- 
Earthling Michel Dänzer   |   https://redhat.com
Libre software enthusiast | Mesa and X developer
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-01 Thread Christian König

Am 01.06.21 um 11:02 schrieb Michel Dänzer:

On 2021-05-27 11:51 p.m., Marek Olšák wrote:

3) Compositors (and other privileged processes, and display flipping) can't 
trust imported/exported fences. They need a timeout recovery mechanism from the 
beginning, and the following are some possible solutions to timeouts:

a) use a CPU wait with a small absolute timeout, and display the previous 
content on timeout
b) use a GPU wait with a small absolute timeout, and conditional rendering will 
choose between the latest content (if signalled) and previous content (if timed 
out)

The result would be that the desktop can run close to 60 fps even if an app 
runs at 1 fps.

FWIW, this is working with
https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1880 , even with 
implicit sync (on current Intel GPUs; amdgpu/radeonsi would need to provide the 
same dma-buf poll semantics as other drivers and high priority GFX contexts via 
EGL_IMG_context_priority which can preempt lower priority ones).


Yeah, that is really nice to have.

One question is if you wait on the CPU or the GPU for the new surface to 
become available? The former is a bit bad for latency and power management.


Another question is if that is sufficient as security for the display 
server or if we need further handling down the road? I mean essentially 
we are moving the reliability problem into the display server.


Regards,
Christian.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-01 Thread Michel Dänzer
On 2021-05-27 11:51 p.m., Marek Olšák wrote:
> 
> 3) Compositors (and other privileged processes, and display flipping) can't 
> trust imported/exported fences. They need a timeout recovery mechanism from 
> the beginning, and the following are some possible solutions to timeouts:
> 
> a) use a CPU wait with a small absolute timeout, and display the previous 
> content on timeout
> b) use a GPU wait with a small absolute timeout, and conditional rendering 
> will choose between the latest content (if signalled) and previous content 
> (if timed out)
> 
> The result would be that the desktop can run close to 60 fps even if an app 
> runs at 1 fps.

FWIW, this is working with
https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1880 , even with 
implicit sync (on current Intel GPUs; amdgpu/radeonsi would need to provide the 
same dma-buf poll semantics as other drivers and high priority GFX contexts via 
EGL_IMG_context_priority which can preempt lower priority ones).


-- 
Earthling Michel Dänzer   |   https://redhat.com
Libre software enthusiast | Mesa and X developer
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev