On Tue, Jul 15, 2025 at 04:17:26PM +0530, Riana Tauro wrote: > Add documentation for vendor specific device wedged recovery method > and runtime survivability.
... > diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c > index bd81ebd370cb..d28c92f8b80c 100644 > --- a/drivers/gpu/drm/xe/xe_device.c > +++ b/drivers/gpu/drm/xe/xe_device.c > @@ -1133,6 +1133,28 @@ static void xe_device_wedged_fini(struct drm_device > *drm, void *arg) > } > > /** > + * DOC: Device Wedging Xe Device Wedging? > + * Xe driver uses device wedged uevent as documented in > Documentation/gpu/drm-uapi.rst. > + * > + * When device is in wedged state, every IOCTL will be blocked and GT cannot > be > + * used. Certain critical errors like gt reset failure, firmware failures > can cause > + * the device to be wedged. The default recovery mechanism for a wedged state method > + * is re-probe (unbind + bind) Let's use uapi naming for consistency. > + * Another recovery method is ``WEDGED=vendor-specific`. Below are the > usecases If we mean method, it's just ``vendor-specific`` with correct quoting. > + * that trigger vendor-specific drm wedged uevent and actions to be performed > + * to recover the device. > + * > + * Case 1: CSC firmware errors require a firmware flash to restore normal > device > + * operation. Since firmware flash is a vendor-specific action > + * `WEDGED=vendor-specific`` recovery method along with > + * :ref:`runtime survivability mode <xe-survivability-mode>` is used > to > + * notify userspace. User can then initiate a firmware flash using > userspace tools > + * like fwupd to restore device to normal situation. > + */ > + > +/* > * xe_device_set_wedged_method - Set wedged recovery method > * @xe: xe device instance > * @method: recovery method to set > diff --git a/drivers/gpu/drm/xe/xe_survivability_mode.c > b/drivers/gpu/drm/xe/xe_survivability_mode.c > index 267d0e3fd85a..9f770db116f4 100644 > --- a/drivers/gpu/drm/xe/xe_survivability_mode.c > +++ b/drivers/gpu/drm/xe/xe_survivability_mode.c > @@ -22,15 +22,18 @@ > #define MAX_SCRATCH_MMIO 8 > > /** > - * DOC: Xe Boot Survivability > + * DOC: Survivability Mode > * > - * Boot Survivability is a software based workflow for recovering a system > in a failed boot state > + * Survivability Mode is a software based workflow for recovering a system > in a failed boot state > * Here system recoverability is concerned with recovering the firmware > responsible for boot. > * > - * This is implemented by loading the driver with bare minimum (no drm card) > to allow the firmware > - * to be flashed through mei and collect telemetry. The driver's probe flow > is modified > - * such that it enters survivability mode when pcode initialization is > incomplete and boot status > - * denotes a failure. > + * Boot Survivability > + * =================== > + * > + * Boot Survivability is implemented by loading the driver with bare minimum > (no drm card) to allow > + * the firmware to be flashed through mei and collect telemetry. The > driver's probe flow is 'mei driver' or it gives the impression of a tool. Also, what telemetry? > + * modified such that it enters survivability mode when pcode initialization > is incomplete and boot > + * status denotes a failure. > * > * Survivability mode can also be entered manually using the survivability > mode attribute available > * through configfs which is beneficial in several usecases. It can be used > to address scenarios > @@ -46,7 +49,7 @@ > * Survivability mode is indicated by the below admin-only readable sysfs > which provides additional If it's sensitive, does it make sense to also log it? > * debug information:: > * > - * /sys/bus/pci/devices/<device>/surivability_mode > + * /sys/bus/pci/devices/<device>/survivability_mode > * > * Capability Information: > * Provides boot status > @@ -56,6 +59,22 @@ > * Provides history of previous failures > * Auxiliary Information > * Certain failures may have information in addition to postcode > information > + * > + * Runtime Survivability > + * ===================== > + * > + * Certain runtime firmware errors can cause the device to enter a wedged > state > + * (:ref:`xe-device-wedging`) requiring a firmware flash to restore normal > operation. > + * Runtime Survivability Mode indicates that a firmware flash is necessary > to recover the device and > + * is indicated by the presence of survivability mode sysfs:: > + * > + * /sys/bus/pci/devices/<device>/survivability_mode > + * > + * Survivability mode sysfs provides information about the type of > survivability mode. > + * > + * When such errors occur, userspace is notified with the drm device wedged > uevent and runtime > + * survivability mode. User can then initiate a firmware update using > userspace tools like fwupd > + * to restore device to normal operation. > */ Overall looks good and gets the point across, but I think consistent termiologies would make it more easy to follow and understand. method/mechanism/actions wedged uevent/drm wedged uevent/drm device wedged uevent firmware flash/firmware update operation/situation ... and so on. Raag