On Thu, Jul 24, 2025 at 08:04:35PM +0530, Riana Tauro wrote: > Add documentation for vendor specific device wedged recovery method > and runtime survivability. > > v2: fix documentation (Raag) > v3: add userspace tool for firmware update (Raag) > v4: use consistent documentation (Raag) > > Signed-off-by: Riana Tauro <riana.ta...@intel.com>
Reviewed-by: Rodrigo Vivi <rodrigo.v...@intel.com> > --- > Documentation/gpu/xe/index.rst | 1 + > Documentation/gpu/xe/xe_device.rst | 10 +++++++ > Documentation/gpu/xe/xe_pcode.rst | 6 ++-- > drivers/gpu/drm/xe/xe_device.c | 22 ++++++++++++++ > drivers/gpu/drm/xe/xe_survivability_mode.c | 35 +++++++++++++++++----- > 5 files changed, 64 insertions(+), 10 deletions(-) > create mode 100644 Documentation/gpu/xe/xe_device.rst > > diff --git a/Documentation/gpu/xe/index.rst b/Documentation/gpu/xe/index.rst > index 42ba6c263cd0..88b22fad880e 100644 > --- a/Documentation/gpu/xe/index.rst > +++ b/Documentation/gpu/xe/index.rst > @@ -25,5 +25,6 @@ DG2, etc is provided to prototype the driver. > xe_tile > xe_debugging > xe_devcoredump > + xe_device > xe-drm-usage-stats.rst > xe_configfs > diff --git a/Documentation/gpu/xe/xe_device.rst > b/Documentation/gpu/xe/xe_device.rst > new file mode 100644 > index 000000000000..39a937b97cd3 > --- /dev/null > +++ b/Documentation/gpu/xe/xe_device.rst > @@ -0,0 +1,10 @@ > +.. SPDX-License-Identifier: (GPL-2.0+ OR MIT) > + > +.. _xe-device-wedging: > + > +================== > +Xe Device Wedging > +================== > + > +.. kernel-doc:: drivers/gpu/drm/xe/xe_device.c > + :doc: Xe Device Wedging > diff --git a/Documentation/gpu/xe/xe_pcode.rst > b/Documentation/gpu/xe/xe_pcode.rst > index 5937ef3599b0..2a43601123cb 100644 > --- a/Documentation/gpu/xe/xe_pcode.rst > +++ b/Documentation/gpu/xe/xe_pcode.rst > @@ -13,9 +13,11 @@ Internal API > .. kernel-doc:: drivers/gpu/drm/xe/xe_pcode.c > :internal: > > +.. _xe-survivability-mode: > + > ================== > -Boot Survivability > +Survivability Mode > ================== > > .. kernel-doc:: drivers/gpu/drm/xe/xe_survivability_mode.c > - :doc: Xe Boot Survivability > + :doc: Survivability Mode > diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c > index df29b87ffc5f..4a34b15f9527 100644 > --- a/drivers/gpu/drm/xe/xe_device.c > +++ b/drivers/gpu/drm/xe/xe_device.c > @@ -1157,6 +1157,28 @@ static void xe_device_wedged_fini(struct drm_device > *drm, void *arg) > } > > /** > + * DOC: Xe Device Wedging > + * > + * Xe driver uses drm device wedged uevent as documented in > Documentation/gpu/drm-uapi.rst. > + * > + * When device is in wedged state, every IOCTL will be blocked and GT cannot > be > + * used. Certain critical errors like gt reset failure, firmware failures > can cause > + * the device to be wedged. The default recovery method for a wedged state > + * is rebind/bus-reset. > + * > + * Another recovery method is vendor-specific. Below are the usecases that > trigger > + * vendor-specific drm device wedged uevent and the procedure to be performed > + * to recover the device. > + * > + * Case 1: CSC firmware errors require a firmware flash to restore normal > device > + * operation. Since firmware flash is a vendor-specific action > + * ``WEDGED=vendor-specific`` recovery method along with > + * :ref:`runtime survivability mode <xe-survivability-mode>` is used > to > + * notify userspace. User can then initiate a firmware flash using > userspace tools > + * like fwupd to restore device to normal situation. > + */ > + > +/* > * xe_device_set_wedged_method - Set wedged recovery method > * @xe: xe device instance > * @method: recovery method to set > diff --git a/drivers/gpu/drm/xe/xe_survivability_mode.c > b/drivers/gpu/drm/xe/xe_survivability_mode.c > index 267d0e3fd85a..86ba767c4e44 100644 > --- a/drivers/gpu/drm/xe/xe_survivability_mode.c > +++ b/drivers/gpu/drm/xe/xe_survivability_mode.c > @@ -22,15 +22,18 @@ > #define MAX_SCRATCH_MMIO 8 > > /** > - * DOC: Xe Boot Survivability > + * DOC: Survivability Mode > * > - * Boot Survivability is a software based workflow for recovering a system > in a failed boot state > + * Survivability Mode is a software based workflow for recovering a system > in a failed boot state > * Here system recoverability is concerned with recovering the firmware > responsible for boot. > * > - * This is implemented by loading the driver with bare minimum (no drm card) > to allow the firmware > - * to be flashed through mei and collect telemetry. The driver's probe flow > is modified > - * such that it enters survivability mode when pcode initialization is > incomplete and boot status > - * denotes a failure. > + * Boot Survivability > + * =================== > + * > + * Boot Survivability is implemented by loading the driver with bare minimum > (no drm card) to allow > + * the firmware to be flashed through mei driver and collect telemetry. The > driver's probe flow is > + * modified such that it enters survivability mode when pcode initialization > is incomplete and boot > + * status denotes a failure. > * > * Survivability mode can also be entered manually using the survivability > mode attribute available > * through configfs which is beneficial in several usecases. It can be used > to address scenarios > @@ -46,7 +49,7 @@ > * Survivability mode is indicated by the below admin-only readable sysfs > which provides additional > * debug information:: > * > - * /sys/bus/pci/devices/<device>/surivability_mode > + * /sys/bus/pci/devices/<device>/survivability_mode > * > * Capability Information: > * Provides boot status > @@ -56,6 +59,22 @@ > * Provides history of previous failures > * Auxiliary Information > * Certain failures may have information in addition to postcode > information > + * > + * Runtime Survivability > + * ===================== > + * > + * Certain runtime firmware errors can cause the device to enter a wedged > state > + * (:ref:`xe-device-wedging`) requiring a firmware flash to restore normal > operation. > + * Runtime Survivability Mode indicates that a firmware flash is necessary > to recover the device and > + * is indicated by the presence of survivability mode sysfs:: > + * > + * /sys/bus/pci/devices/<device>/survivability_mode > + * > + * Survivability mode sysfs provides information about the type of > survivability mode. > + * > + * When such errors occur, userspace is notified with the drm device wedged > uevent and runtime > + * survivability mode. User can then initiate a firmware flash using > userspace tools like fwupd > + * to restore device to normal operation. > */ > > static u32 aux_history_offset(u32 reg_value) > @@ -327,7 +346,7 @@ int xe_survivability_mode_runtime_enable(struct xe_device > *xe) > > xe_device_set_wedged_method(xe, DRM_WEDGE_RECOVERY_VENDOR); > xe_device_declare_wedged(xe); > - dev_err(&pdev->dev, "Firmware update required, Refer the userspace > documentation for more details!\n"); > + dev_err(&pdev->dev, "Firmware flash required, Refer the userspace > documentation for more details!\n"); > > return 0; > } > -- > 2.47.1 >