Re: [PATCH v7 1/9] drm: Add a vendor-specific recovery method to drm device wedged uevent

Riana Tauro Thu, 31 Jul 2025 06:49:11 -0700



On 7/31/2025 6:31 PM, Maxime Ripard wrote:

On Thu, Jul 31, 2025 at 04:43:46PM +0530, Riana Tauro wrote:

Hi Maxim

On 7/31/2025 3:02 PM, Maxime Ripard wrote:

Hi,

On Wed, Jul 30, 2025 at 07:33:01PM +0530, Riana Tauro wrote:

On 7/28/2025 3:57 PM, Riana Tauro wrote:

Address the need for a recovery method (firmware flash on Firmware errors)
introduced in the later patches of Xe KMD.
Whenever XE KMD detects a firmware error, a firmware flash is required to
recover the device to normal operation.

The initial proposal to use 'firmware-flash' as a recovery method was
not applicable to other drivers and could cause multiple recovery
methods specific to vendors to be added.
To address this a more generic 'vendor-specific' method is introduced,
guiding users to refer to vendor specific documentation and system logs
for detailed vendor specific recovery procedure.

Add a recovery method 'WEDGED=vendor-specific' for such errors.
Vendors must provide additional recovery documentation if this method
is used.

It is the responsibility of the consumer to refer to the correct vendor
specific documentation and usecase before attempting a recovery.

For example: If driver is XE KMD, the consumer must refer
to the documentation of 'Device Wedging' under 'Documentation/gpu/xe/'.

Recovery script contributed by Raag.

v2: fix documentation (Raag)
v3: add more details to commit message (Sima, Rodrigo, Raag)
       add an example script to the documentation (Raag)
v4: use consistent naming (Raag)
v5: fix commit message

Cc: André Almeida <andrealm...@igalia.com>
Cc: Christian König <christian.koe...@amd.com>
Cc: David Airlie <airl...@gmail.com>
Cc: Simona Vetter <simona.vet...@ffwll.ch>
Co-developed-by: Raag Jadav <raag.ja...@intel.com>
Signed-off-by: Raag Jadav <raag.ja...@intel.com>
Signed-off-by: Riana Tauro <riana.ta...@intel.com>
Reviewed-by: Rodrigo Vivi <rodrigo.v...@intel.com>


This patch needs an ack from drm to be merged.
The rest of the series have RB's. Can someone please provide an ack ?

Cc: drm-misc maintainers

Thanks
Riana

---
    Documentation/gpu/drm-uapi.rst | 42 ++++++++++++++++++++++++++++------
    drivers/gpu/drm/drm_drv.c      |  2 ++
    include/drm/drm_device.h       |  4 ++++
    3 files changed, 41 insertions(+), 7 deletions(-)

diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
index 843facf01b2d..5691b29acde3 100644
--- a/Documentation/gpu/drm-uapi.rst
+++ b/Documentation/gpu/drm-uapi.rst
@@ -418,13 +418,15 @@ needed.
    Recovery
    --------
-Current implementation defines three recovery methods, out of which, drivers
+Current implementation defines four recovery methods, out of which, drivers
    can use any one, multiple or none. Method(s) of choice will be sent in the
    uevent environment as ``WEDGED=<method1>[,..,<methodN>]`` in order of less 
to
-more side-effects. If driver is unsure about recovery or method is unknown
-(like soft/hard system reboot, firmware flashing, physical device replacement
-or any other procedure which can't be attempted on the fly), ``WEDGED=unknown``
-will be sent instead.
+more side-effects. If recovery method is specific to vendor
+``WEDGED=vendor-specific`` will be sent and userspace should refer to vendor
+specific documentation for the recovery procedure. As an example if the driver
+is 'Xe' then the documentation for 'Device Wedging' of Xe driver needs to be
+referred for the recovery procedure. If driver is unsure about recovery or
+method is unknown, ``WEDGED=unknown`` will be sent instead.
    Userspace consumers can parse this event and attempt recovery as per the
    following expectations.
@@ -435,6 +437,7 @@ following expectations.
        none            optional telemetry collection
        rebind          unbind + bind driver
        bus-reset       unbind + bus reset/re-enumeration + bind
+    vendor-specific vendor specific recovery method
        unknown         consumer policy
        =============== ========================================
@@ -472,8 +475,12 @@ erroring out, all device memory should be unmapped and 
file descriptors should
    be closed to prevent leaks or undefined behaviour. The idea here is to 
clear the
    device of all user context beforehand and set the stage for a clean 
recovery.
-Example
--------
+For ``WEDGED=vendor-specific`` recovery method, it is the responsibility of the
+consumer to check the driver documentation and the usecase before attempting
+a recovery.
+
+Example - rebind
+----------------
    Udev rule::
@@ -491,6 +498,27 @@ Recovery script::
        echo -n $DEVICE > $DRIVER/unbind
        echo -n $DEVICE > $DRIVER/bind
+Example - vendor-specific
+-------------------------
+
+Udev rule::
+
+    SUBSYSTEM=="drm", ENV{WEDGED}=="vendor-specific", 
DEVPATH=="*/drm/card[0-9]",
+    RUN+="/path/to/vendor_specific_recovery.sh $env{DEVPATH}"
+
+Recovery script::
+
+    #!/bin/sh
+
+    DEVPATH=$(readlink -f /sys/$1/device)
+    DRIVERPATH=$(readlink -f $DEVPATH/driver)
+    DRIVER=$(basename $DRIVERPATH)
+
+    if [ "$DRIVER" = "xe" ]; then
+        # Refer XE documentation and check usecase and recovery procedure
+    fi
+
+


So I guess I'm not opposed to it on principle, but the documentation
really needs some work.

You should at least list the valid vendor specific options, and what
each mean exactly. Ideally, it should be a link to the datasheet/manual
detailing the recovery procedure,


This is added above

"If recovery method is specific to vendor ``WEDGED=vendor-specific`` will be
sent and userspace should refer to vendor specific documentation for the
recovery procedure. As an example if the driver is 'Xe' then the
documentation for 'Device Wedging' of Xe driver needs to be referred for the
recovery procedure."

The documentation of Xe is in Patch 6

https://lore.kernel.org/intel-xe/20250728102809.502324-7-riana.ta...@intel.com/


I'm sorry, I still don't get how, as a user, I can reimplement what that
tool is supposed to be doing. Or do you anticipate that there's only
ever be a single way to recover a Xe device, which is to reflash the
firmware?

As suggested, it is a combination of uevent with an additional vendorspecific indicator. In this case, along with the uevent there is a sysfsand dmesg that indicates firmware flash is the required

In the future, if there is another vendor-specific recovery needed thensame will be indicated in dmesg/sysfs or any vendor specific way and thecase will be documented


What if in ~5y, Intel comes up with a new recovery method for the newer
models?

That will be documented in the Xe Device Wedging documentation asmentioned above

I'll add the link instead of just the chapter name

but if that's under NDA, at least a
reference to the document and section you need to look at to implement
it properly.

Or if that's still not doable, anything that tells you what to do
instead of "run a shell script we don't provide".

Also, we just discussed it with Sima on IRC, and she mentioned that we
probably want to have a vendor specific prefix for each vendor-specific
method.


This was discussed as part of Rev4

https://lore.kernel.org/intel-xe/ag-u9jtxdah_t...@black.fi.intel.com/

DEVPATH from uevent and driver should be able to identify the driver.
Shouldn't that be enough?


See above. What happens if we start to see systems with two Xe GPUs, one
with a new recovery method and one with an old recovery method?

The DEV_PATH will indicate the device and the vendor-specific indicatorlike dmesg/sysfs will indicate the recovery procedure


Thanks
Riana


Maxime

Re: [PATCH v7 1/9] drm: Add a vendor-specific recovery method to drm device wedged uevent

Reply via email to