On Thu, Jul 31, 2025 at 03:01:18PM +0200, Maxime Ripard wrote: > On Thu, Jul 31, 2025 at 04:43:46PM +0530, Riana Tauro wrote: > > Hi Maxim > > > > On 7/31/2025 3:02 PM, Maxime Ripard wrote: > > > Hi, > > > > > > On Wed, Jul 30, 2025 at 07:33:01PM +0530, Riana Tauro wrote: > > > > On 7/28/2025 3:57 PM, Riana Tauro wrote: > > > > > Address the need for a recovery method (firmware flash on Firmware > > > > > errors) > > > > > introduced in the later patches of Xe KMD. > > > > > Whenever XE KMD detects a firmware error, a firmware flash is > > > > > required to > > > > > recover the device to normal operation. > > > > > > > > > > The initial proposal to use 'firmware-flash' as a recovery method was > > > > > not applicable to other drivers and could cause multiple recovery > > > > > methods specific to vendors to be added. > > > > > To address this a more generic 'vendor-specific' method is introduced, > > > > > guiding users to refer to vendor specific documentation and system > > > > > logs > > > > > for detailed vendor specific recovery procedure. > > > > > > > > > > Add a recovery method 'WEDGED=vendor-specific' for such errors. > > > > > Vendors must provide additional recovery documentation if this method > > > > > is used. > > > > > > > > > > It is the responsibility of the consumer to refer to the correct > > > > > vendor > > > > > specific documentation and usecase before attempting a recovery. > > > > > > > > > > For example: If driver is XE KMD, the consumer must refer > > > > > to the documentation of 'Device Wedging' under > > > > > 'Documentation/gpu/xe/'. > > > > > > > > > > Recovery script contributed by Raag. > > > > > > > > > > v2: fix documentation (Raag) > > > > > v3: add more details to commit message (Sima, Rodrigo, Raag) > > > > > add an example script to the documentation (Raag) > > > > > v4: use consistent naming (Raag) > > > > > v5: fix commit message > > > > > > > > > > Cc: André Almeida <andrealm...@igalia.com> > > > > > Cc: Christian König <christian.koe...@amd.com> > > > > > Cc: David Airlie <airl...@gmail.com> > > > > > Cc: Simona Vetter <simona.vet...@ffwll.ch> > > > > > Co-developed-by: Raag Jadav <raag.ja...@intel.com> > > > > > Signed-off-by: Raag Jadav <raag.ja...@intel.com> > > > > > Signed-off-by: Riana Tauro <riana.ta...@intel.com> > > > > > Reviewed-by: Rodrigo Vivi <rodrigo.v...@intel.com> > > > > > > > > This patch needs an ack from drm to be merged. > > > > The rest of the series have RB's. Can someone please provide an ack ? > > > > > > > > Cc: drm-misc maintainers > > > > > > > > Thanks > > > > Riana > > > > > > > > > --- > > > > > Documentation/gpu/drm-uapi.rst | 42 > > > > > ++++++++++++++++++++++++++++------ > > > > > drivers/gpu/drm/drm_drv.c | 2 ++ > > > > > include/drm/drm_device.h | 4 ++++ > > > > > 3 files changed, 41 insertions(+), 7 deletions(-) > > > > > > > > > > diff --git a/Documentation/gpu/drm-uapi.rst > > > > > b/Documentation/gpu/drm-uapi.rst > > > > > index 843facf01b2d..5691b29acde3 100644 > > > > > --- a/Documentation/gpu/drm-uapi.rst > > > > > +++ b/Documentation/gpu/drm-uapi.rst > > > > > @@ -418,13 +418,15 @@ needed. > > > > > Recovery > > > > > -------- > > > > > -Current implementation defines three recovery methods, out of which, > > > > > drivers > > > > > +Current implementation defines four recovery methods, out of which, > > > > > drivers > > > > > can use any one, multiple or none. Method(s) of choice will be > > > > > sent in the > > > > > uevent environment as ``WEDGED=<method1>[,..,<methodN>]`` in order > > > > > of less to > > > > > -more side-effects. If driver is unsure about recovery or method is > > > > > unknown > > > > > -(like soft/hard system reboot, firmware flashing, physical device > > > > > replacement > > > > > -or any other procedure which can't be attempted on the fly), > > > > > ``WEDGED=unknown`` > > > > > -will be sent instead. > > > > > +more side-effects. If recovery method is specific to vendor > > > > > +``WEDGED=vendor-specific`` will be sent and userspace should refer > > > > > to vendor > > > > > +specific documentation for the recovery procedure. As an example if > > > > > the driver > > > > > +is 'Xe' then the documentation for 'Device Wedging' of Xe driver > > > > > needs to be > > > > > +referred for the recovery procedure. If driver is unsure about > > > > > recovery or > > > > > +method is unknown, ``WEDGED=unknown`` will be sent instead. > > > > > Userspace consumers can parse this event and attempt recovery as > > > > > per the > > > > > following expectations. > > > > > @@ -435,6 +437,7 @@ following expectations. > > > > > none optional telemetry collection > > > > > rebind unbind + bind driver > > > > > bus-reset unbind + bus reset/re-enumeration + bind > > > > > + vendor-specific vendor specific recovery method > > > > > unknown consumer policy > > > > > =============== ======================================== > > > > > @@ -472,8 +475,12 @@ erroring out, all device memory should be > > > > > unmapped and file descriptors should > > > > > be closed to prevent leaks or undefined behaviour. The idea here > > > > > is to clear the > > > > > device of all user context beforehand and set the stage for a > > > > > clean recovery. > > > > > -Example > > > > > -------- > > > > > +For ``WEDGED=vendor-specific`` recovery method, it is the > > > > > responsibility of the > > > > > +consumer to check the driver documentation and the usecase before > > > > > attempting > > > > > +a recovery. > > > > > + > > > > > +Example - rebind > > > > > +---------------- > > > > > Udev rule:: > > > > > @@ -491,6 +498,27 @@ Recovery script:: > > > > > echo -n $DEVICE > $DRIVER/unbind > > > > > echo -n $DEVICE > $DRIVER/bind > > > > > +Example - vendor-specific > > > > > +------------------------- > > > > > + > > > > > +Udev rule:: > > > > > + > > > > > + SUBSYSTEM=="drm", ENV{WEDGED}=="vendor-specific", > > > > > DEVPATH=="*/drm/card[0-9]", > > > > > + RUN+="/path/to/vendor_specific_recovery.sh $env{DEVPATH}" > > > > > + > > > > > +Recovery script:: > > > > > + > > > > > + #!/bin/sh > > > > > + > > > > > + DEVPATH=$(readlink -f /sys/$1/device) > > > > > + DRIVERPATH=$(readlink -f $DEVPATH/driver) > > > > > + DRIVER=$(basename $DRIVERPATH) > > > > > + > > > > > + if [ "$DRIVER" = "xe" ]; then > > > > > + # Refer XE documentation and check usecase and recovery > > > > > procedure > > > > > + fi > > > > > + > > > > > + > > > > > > So I guess I'm not opposed to it on principle, but the documentation > > > really needs some work. > > > > > > You should at least list the valid vendor specific options, and what > > > each mean exactly. Ideally, it should be a link to the datasheet/manual > > > detailing the recovery procedure, > > > > This is added above > > > > "If recovery method is specific to vendor ``WEDGED=vendor-specific`` will be > > sent and userspace should refer to vendor specific documentation for the > > recovery procedure. As an example if the driver is 'Xe' then the > > documentation for 'Device Wedging' of Xe driver needs to be referred for the > > recovery procedure." > > > > The documentation of Xe is in Patch 6 > > > > https://lore.kernel.org/intel-xe/20250728102809.502324-7-riana.ta...@intel.com/ > > I'm sorry, I still don't get how, as a user, I can reimplement what that > tool is supposed to be doing. Or do you anticipate that there's only > ever be a single way to recover a Xe device, which is to reflash the > firmware?
Well, clearly the documentation here needs some adjustments since it is failing to explain everything that was previously discussed already: https://lore.kernel.org/dri-devel/aHDRyZBkw-Qa_30R@phenom.ffwll.local/ Since I was already familiar with the background and history, I found the documentation quite straightforward. However, I now believe it could be updated to make it clearer to anyone. > > What if in ~5y, Intel comes up with a new recovery method for the newer > models? the vendor-specific is by definition wedge-uevent[device-specific] + specific indication. Either by log or sysfs. So this design as is is flexible enough. As I stated in the previous discussions I was more in favor of the wedge-uevent[firmware-flash] but that was too specific, single-vendor and single-case. So the vendor-specific with further indication was the chosen one. > > > I'll add the link instead of just the chapter name > > > but if that's under NDA, at least a > > > reference to the document and section you need to look at to implement > > > it properly. > > > > > > Or if that's still not doable, anything that tells you what to do > > > instead of "run a shell script we don't provide". > > > > > > Also, we just discussed it with Sima on IRC, and she mentioned that we > > > probably want to have a vendor specific prefix for each vendor-specific > > > method. > > > > This was discussed as part of Rev4 > > > > https://lore.kernel.org/intel-xe/ag-u9jtxdah_t...@black.fi.intel.com/ > > > > DEVPATH from uevent and driver should be able to identify the driver. > > Shouldn't that be enough? > > See above. What happens if we start to see systems with two Xe GPUs, one > with a new recovery method and one with an old recovery method? The log or device's sysfs will have the correct indication for the correct device. Nothing to worry here. > > Maxime