On Fri, Aug 4, 2023 at 3:03 PM Daniel Vetter <dan...@ffwll.ch> wrote:
>
> On Tue, Jun 27, 2023 at 10:23:23AM -0300, André Almeida wrote:
> > Create a section that specifies how to deal with DRM device resets for
> > kernel and userspace drivers.
> >
> > Acked-by: Pekka Paalanen <pekka.paala...@collabora.com>
> > Signed-off-by: André Almeida <andrealm...@igalia.com>
> > ---
> >
> > v4: 
> > https://lore.kernel.org/lkml/20230626183347.55118-1-andrealm...@igalia.com/
> >
> > Changes:
> >  - Grammar fixes (Randy)
> >
> >  Documentation/gpu/drm-uapi.rst | 68 ++++++++++++++++++++++++++++++++++
> >  1 file changed, 68 insertions(+)
> >
> > diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> > index 65fb3036a580..3cbffa25ed93 100644
> > --- a/Documentation/gpu/drm-uapi.rst
> > +++ b/Documentation/gpu/drm-uapi.rst
> > @@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a third 
> > handler for
> >  mmapped regular files. Threads cause additional pain with signal
> >  handling as well.
> >
> > +Device reset
> > +============
> > +
> > +The GPU stack is really complex and is prone to errors, from hardware bugs,
> > +faulty applications and everything in between the many layers. Some errors
> > +require resetting the device in order to make the device usable again. This
> > +sections describes the expectations for DRM and usermode drivers when a
> > +device resets and how to propagate the reset status.
> > +
> > +Kernel Mode Driver
> > +------------------
> > +
> > +The KMD is responsible for checking if the device needs a reset, and to 
> > perform
> > +it as needed. Usually a hang is detected when a job gets stuck executing. 
> > KMD
> > +should keep track of resets, because userspace can query any time about the
> > +reset stats for an specific context. This is needed to propagate to the 
> > rest of
> > +the stack that a reset has happened. Currently, this is implemented by each
> > +driver separately, with no common DRM interface.
> > +
> > +User Mode Driver
> > +----------------
> > +
> > +The UMD should check before submitting new commands to the KMD if the 
> > device has
> > +been reset, and this can be checked more often if the UMD requires it. 
> > After
> > +detecting a reset, UMD will then proceed to report it to the application 
> > using
> > +the appropriate API error code, as explained in the section below about
> > +robustness.
> > +
> > +Robustness
> > +----------
> > +
> > +The only way to try to keep an application working after a reset is if it
> > +complies with the robustness aspects of the graphical API that it is using.
> > +
> > +Graphical APIs provide ways to applications to deal with device resets. 
> > However,
> > +there is no guarantee that the app will use such features correctly, and 
> > the
> > +UMD can implement policies to close the app if it is a repeating offender,
>
> Not sure whether this one here is due to my input, but s/UMD/KMD. Repeat
> offender killing is more a policy where the kernel enforces policy, and no
> longer up to userspace to dtrt (because very clearly userspace is not
> really doing the right thing anymore when it's just hanging the gpu in an
> endless loop). Also maybe tune it down further to something like "the
> kernel driver may implemnent ..."
>
> In my opinion the umd shouldn't implement these kind of magic guesses, the
> entire point of robustness apis is to delegate responsibility for
> correctly recovering to the application. And the kernel is left with
> enforcing fair resource usage policies (which eventually might be a
> cgroups limit on how much gpu time you're allowed to waste with gpu
> resets).

Killing apps that the kernel thinks are misbehaving really doesn't
seem like a good idea to me. What if the process is a service getting
restarted after getting killed? What if killing that process leaves
the system in a bad state?

Can't the kernel provide some information to user space so that e.g.
systemd can handle those situations?

> > +likely in a broken loop. This is done to ensure that it does not keep 
> > blocking
> > +the user interface from being correctly displayed. This should be done 
> > even if
> > +the app is correct but happens to trigger some bug in the hardware/driver.
> > +
> > +OpenGL
> > +~~~~~~
> > +
> > +Apps using OpenGL should use the available robust interfaces, like the
> > +extension ``GL_ARB_robustness`` (or ``GL_EXT_robustness`` for OpenGL ES). 
> > This
> > +interface tells if a reset has happened, and if so, all the context state 
> > is
> > +considered lost and the app proceeds by creating new ones. If it is 
> > possible to
> > +determine that robustness is not in use, the UMD will terminate the app 
> > when a
> > +reset is detected, giving that the contexts are lost and the app won't be 
> > able
> > +to figure this out and recreate the contexts.
> > +
> > +Vulkan
> > +~~~~~~
> > +
> > +Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for 
> > submissions.
> > +This error code means, among other things, that a device reset has 
> > happened and
> > +it needs to recreate the contexts to keep going.
> > +
> > +Reporting causes of resets
> > +--------------------------
> > +
> > +Apart from propagating the reset through the stack so apps can recover, 
> > it's
> > +really useful for driver developers to learn more about what caused the 
> > reset in
> > +first place. DRM devices should make use of devcoredump to store relevant
> > +information about the reset, so this information can be added to user bug
> > +reports.
>
> Since we do not seem to have a solid consensus in the community about
> non-robust userspace, maybe we could just document that lack of consensus
> to unblock this patch? Something like this:
>
> Non-Robust Userspace
> --------------------
>
> Userspace that doesn't support robust interfaces (like an non-robust
> OpenGL context or API without any robustness support like libva) leave the
> robustness handling entirely to the userspace driver. There is no strong
> community consensus on what the userspace driver should do in that case,
> since all reasonable approaches have some clear downsides.
>
> With the s/UMD/KMD/ further up and maybe something added to record the
> non-robustness non-consensus:
>
> Acked-by: Daniel Vetter <daniel.vet...@ffwll.ch>
>
> Cheers, Daniel
>
>
>
> > +
> >  .. _drm_driver_ioctl:
> >
> >  IOCTL Support on Device Nodes
> > --
> > 2.41.0
> >
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
>

Reply via email to