On 7/31/2025 6:31 PM, Maxime Ripard wrote:
On Thu, Jul 31, 2025 at 04:43:46PM +0530, Riana Tauro wrote:
Hi Maxim
On 7/31/2025 3:02 PM, Maxime Ripard wrote:
Hi,
On Wed, Jul 30, 2025 at 07:33:01PM +0530, Riana Tauro wrote:
On 7/28/2025 3:57 PM, Riana Tauro wrote:
Address the need for a recovery method (firmware flash on Firmware errors)
introduced in the later patches of Xe KMD.
Whenever XE KMD detects a firmware error, a firmware flash is required to
recover the device to normal operation.
The initial proposal to use 'firmware-flash' as a recovery method was
not applicable to other drivers and could cause multiple recovery
methods specific to vendors to be added.
To address this a more generic 'vendor-specific' method is introduced,
guiding users to refer to vendor specific documentation and system logs
for detailed vendor specific recovery procedure.
Add a recovery method 'WEDGED=vendor-specific' for such errors.
Vendors must provide additional recovery documentation if this method
is used.
It is the responsibility of the consumer to refer to the correct vendor
specific documentation and usecase before attempting a recovery.
For example: If driver is XE KMD, the consumer must refer
to the documentation of 'Device Wedging' under 'Documentation/gpu/xe/'.
Recovery script contributed by Raag.
v2: fix documentation (Raag)
v3: add more details to commit message (Sima, Rodrigo, Raag)
add an example script to the documentation (Raag)
v4: use consistent naming (Raag)
v5: fix commit message
Cc: André Almeida <andrealm...@igalia.com>
Cc: Christian König <christian.koe...@amd.com>
Cc: David Airlie <airl...@gmail.com>
Cc: Simona Vetter <simona.vet...@ffwll.ch>
Co-developed-by: Raag Jadav <raag.ja...@intel.com>
Signed-off-by: Raag Jadav <raag.ja...@intel.com>
Signed-off-by: Riana Tauro <riana.ta...@intel.com>
Reviewed-by: Rodrigo Vivi <rodrigo.v...@intel.com>
This patch needs an ack from drm to be merged.
The rest of the series have RB's. Can someone please provide an ack ?
Cc: drm-misc maintainers
Thanks
Riana
---
Documentation/gpu/drm-uapi.rst | 42 ++++++++++++++++++++++++++++------
drivers/gpu/drm/drm_drv.c | 2 ++
include/drm/drm_device.h | 4 ++++
3 files changed, 41 insertions(+), 7 deletions(-)
diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
index 843facf01b2d..5691b29acde3 100644
--- a/Documentation/gpu/drm-uapi.rst
+++ b/Documentation/gpu/drm-uapi.rst
@@ -418,13 +418,15 @@ needed.
Recovery
--------
-Current implementation defines three recovery methods, out of which, drivers
+Current implementation defines four recovery methods, out of which, drivers
can use any one, multiple or none. Method(s) of choice will be sent in the
uevent environment as ``WEDGED=<method1>[,..,<methodN>]`` in order of less
to
-more side-effects. If driver is unsure about recovery or method is unknown
-(like soft/hard system reboot, firmware flashing, physical device replacement
-or any other procedure which can't be attempted on the fly), ``WEDGED=unknown``
-will be sent instead.
+more side-effects. If recovery method is specific to vendor
+``WEDGED=vendor-specific`` will be sent and userspace should refer to vendor
+specific documentation for the recovery procedure. As an example if the driver
+is 'Xe' then the documentation for 'Device Wedging' of Xe driver needs to be
+referred for the recovery procedure. If driver is unsure about recovery or
+method is unknown, ``WEDGED=unknown`` will be sent instead.
Userspace consumers can parse this event and attempt recovery as per the
following expectations.
@@ -435,6 +437,7 @@ following expectations.
none optional telemetry collection
rebind unbind + bind driver
bus-reset unbind + bus reset/re-enumeration + bind
+ vendor-specific vendor specific recovery method
unknown consumer policy
=============== ========================================
@@ -472,8 +475,12 @@ erroring out, all device memory should be unmapped and
file descriptors should
be closed to prevent leaks or undefined behaviour. The idea here is to
clear the
device of all user context beforehand and set the stage for a clean
recovery.
-Example
--------
+For ``WEDGED=vendor-specific`` recovery method, it is the responsibility of the
+consumer to check the driver documentation and the usecase before attempting
+a recovery.
+
+Example - rebind
+----------------
Udev rule::
@@ -491,6 +498,27 @@ Recovery script::
echo -n $DEVICE > $DRIVER/unbind
echo -n $DEVICE > $DRIVER/bind
+Example - vendor-specific
+-------------------------
+
+Udev rule::
+
+ SUBSYSTEM=="drm", ENV{WEDGED}=="vendor-specific",
DEVPATH=="*/drm/card[0-9]",
+ RUN+="/path/to/vendor_specific_recovery.sh $env{DEVPATH}"
+
+Recovery script::
+
+ #!/bin/sh
+
+ DEVPATH=$(readlink -f /sys/$1/device)
+ DRIVERPATH=$(readlink -f $DEVPATH/driver)
+ DRIVER=$(basename $DRIVERPATH)
+
+ if [ "$DRIVER" = "xe" ]; then
+ # Refer XE documentation and check usecase and recovery procedure
+ fi
+
+
So I guess I'm not opposed to it on principle, but the documentation
really needs some work.
You should at least list the valid vendor specific options, and what
each mean exactly. Ideally, it should be a link to the datasheet/manual
detailing the recovery procedure,
This is added above
"If recovery method is specific to vendor ``WEDGED=vendor-specific`` will be
sent and userspace should refer to vendor specific documentation for the
recovery procedure. As an example if the driver is 'Xe' then the
documentation for 'Device Wedging' of Xe driver needs to be referred for the
recovery procedure."
The documentation of Xe is in Patch 6
https://lore.kernel.org/intel-xe/20250728102809.502324-7-riana.ta...@intel.com/
I'm sorry, I still don't get how, as a user, I can reimplement what that
tool is supposed to be doing. Or do you anticipate that there's only
ever be a single way to recover a Xe device, which is to reflash the
firmware?
As suggested, it is a combination of uevent with an additional vendor
specific indicator. In this case, along with the uevent there is a sysfs
and dmesg that indicates firmware flash is the required
In the future, if there is another vendor-specific recovery needed then
same will be indicated in dmesg/sysfs or any vendor specific way and the
case will be documented
What if in ~5y, Intel comes up with a new recovery method for the newer
models?
That will be documented in the Xe Device Wedging documentation as
mentioned above
I'll add the link instead of just the chapter name
but if that's under NDA, at least a
reference to the document and section you need to look at to implement
it properly.
Or if that's still not doable, anything that tells you what to do
instead of "run a shell script we don't provide".
Also, we just discussed it with Sima on IRC, and she mentioned that we
probably want to have a vendor specific prefix for each vendor-specific
method.
This was discussed as part of Rev4
https://lore.kernel.org/intel-xe/ag-u9jtxdah_t...@black.fi.intel.com/
DEVPATH from uevent and driver should be able to identify the driver.
Shouldn't that be enough?
See above. What happens if we start to see systems with two Xe GPUs, one
with a new recovery method and one with an old recovery method?
The DEV_PATH will indicate the device and the vendor-specific indicator
like dmesg/sysfs will indicate the recovery procedure
Thanks
Riana
Maxime