Hi, On Wed, Jul 30, 2025 at 07:33:01PM +0530, Riana Tauro wrote: > On 7/28/2025 3:57 PM, Riana Tauro wrote: > > Address the need for a recovery method (firmware flash on Firmware errors) > > introduced in the later patches of Xe KMD. > > Whenever XE KMD detects a firmware error, a firmware flash is required to > > recover the device to normal operation. > > > > The initial proposal to use 'firmware-flash' as a recovery method was > > not applicable to other drivers and could cause multiple recovery > > methods specific to vendors to be added. > > To address this a more generic 'vendor-specific' method is introduced, > > guiding users to refer to vendor specific documentation and system logs > > for detailed vendor specific recovery procedure. > > > > Add a recovery method 'WEDGED=vendor-specific' for such errors. > > Vendors must provide additional recovery documentation if this method > > is used. > > > > It is the responsibility of the consumer to refer to the correct vendor > > specific documentation and usecase before attempting a recovery. > > > > For example: If driver is XE KMD, the consumer must refer > > to the documentation of 'Device Wedging' under 'Documentation/gpu/xe/'. > > > > Recovery script contributed by Raag. > > > > v2: fix documentation (Raag) > > v3: add more details to commit message (Sima, Rodrigo, Raag) > > add an example script to the documentation (Raag) > > v4: use consistent naming (Raag) > > v5: fix commit message > > > > Cc: André Almeida <andrealm...@igalia.com> > > Cc: Christian König <christian.koe...@amd.com> > > Cc: David Airlie <airl...@gmail.com> > > Cc: Simona Vetter <simona.vet...@ffwll.ch> > > Co-developed-by: Raag Jadav <raag.ja...@intel.com> > > Signed-off-by: Raag Jadav <raag.ja...@intel.com> > > Signed-off-by: Riana Tauro <riana.ta...@intel.com> > > Reviewed-by: Rodrigo Vivi <rodrigo.v...@intel.com> > > This patch needs an ack from drm to be merged. > The rest of the series have RB's. Can someone please provide an ack ? > > Cc: drm-misc maintainers > > Thanks > Riana > > > --- > > Documentation/gpu/drm-uapi.rst | 42 ++++++++++++++++++++++++++++------ > > drivers/gpu/drm/drm_drv.c | 2 ++ > > include/drm/drm_device.h | 4 ++++ > > 3 files changed, 41 insertions(+), 7 deletions(-) > > > > diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst > > index 843facf01b2d..5691b29acde3 100644 > > --- a/Documentation/gpu/drm-uapi.rst > > +++ b/Documentation/gpu/drm-uapi.rst > > @@ -418,13 +418,15 @@ needed. > > Recovery > > -------- > > -Current implementation defines three recovery methods, out of which, > > drivers > > +Current implementation defines four recovery methods, out of which, drivers > > can use any one, multiple or none. Method(s) of choice will be sent in the > > uevent environment as ``WEDGED=<method1>[,..,<methodN>]`` in order of > > less to > > -more side-effects. If driver is unsure about recovery or method is unknown > > -(like soft/hard system reboot, firmware flashing, physical device > > replacement > > -or any other procedure which can't be attempted on the fly), > > ``WEDGED=unknown`` > > -will be sent instead. > > +more side-effects. If recovery method is specific to vendor > > +``WEDGED=vendor-specific`` will be sent and userspace should refer to > > vendor > > +specific documentation for the recovery procedure. As an example if the > > driver > > +is 'Xe' then the documentation for 'Device Wedging' of Xe driver needs to > > be > > +referred for the recovery procedure. If driver is unsure about recovery or > > +method is unknown, ``WEDGED=unknown`` will be sent instead. > > Userspace consumers can parse this event and attempt recovery as per the > > following expectations. > > @@ -435,6 +437,7 @@ following expectations. > > none optional telemetry collection > > rebind unbind + bind driver > > bus-reset unbind + bus reset/re-enumeration + bind > > + vendor-specific vendor specific recovery method > > unknown consumer policy > > =============== ======================================== > > @@ -472,8 +475,12 @@ erroring out, all device memory should be unmapped and > > file descriptors should > > be closed to prevent leaks or undefined behaviour. The idea here is to > > clear the > > device of all user context beforehand and set the stage for a clean > > recovery. > > -Example > > -------- > > +For ``WEDGED=vendor-specific`` recovery method, it is the responsibility > > of the > > +consumer to check the driver documentation and the usecase before > > attempting > > +a recovery. > > + > > +Example - rebind > > +---------------- > > Udev rule:: > > @@ -491,6 +498,27 @@ Recovery script:: > > echo -n $DEVICE > $DRIVER/unbind > > echo -n $DEVICE > $DRIVER/bind > > +Example - vendor-specific > > +------------------------- > > + > > +Udev rule:: > > + > > + SUBSYSTEM=="drm", ENV{WEDGED}=="vendor-specific", > > DEVPATH=="*/drm/card[0-9]", > > + RUN+="/path/to/vendor_specific_recovery.sh $env{DEVPATH}" > > + > > +Recovery script:: > > + > > + #!/bin/sh > > + > > + DEVPATH=$(readlink -f /sys/$1/device) > > + DRIVERPATH=$(readlink -f $DEVPATH/driver) > > + DRIVER=$(basename $DRIVERPATH) > > + > > + if [ "$DRIVER" = "xe" ]; then > > + # Refer XE documentation and check usecase and recovery procedure > > + fi > > + > > +
So I guess I'm not opposed to it on principle, but the documentation really needs some work. You should at least list the valid vendor specific options, and what each mean exactly. Ideally, it should be a link to the datasheet/manual detailing the recovery procedure, but if that's under NDA, at least a reference to the document and section you need to look at to implement it properly. Or if that's still not doable, anything that tells you what to do instead of "run a shell script we don't provide". Also, we just discussed it with Sima on IRC, and she mentioned that we probably want to have a vendor specific prefix for each vendor-specific method. Maxime
signature.asc
Description: PGP signature