On Tue, Jun 24, 2025 at 05:36:29PM -0400, Rodrigo Vivi wrote: > On Tue, Jun 24, 2025 at 04:23:31PM +0200, Christian König wrote: > > On 24.06.25 16:03, Riana Tauro wrote: > > > Hi Christian > > > > > > On 6/24/2025 5:56 PM, Christian König wrote: > > >> On 23.06.25 12:01, Riana Tauro wrote: > > >>> A device is declared wedged when it is non-recoverable from > > >>> the driver context. > > >> > > >> Well, not quite. > > > > > > i took this from the below document. Should it be changed? > > > > The wedge event basically meant that something unexpected happened during > > the lifetime of the the device (crash, hang whatever). > > > > It can be that the device recovered on it's own and nothing needs to be > > done (the none case in the documentation) and the event is just send for > > telemetry collection. > > > > But the usual case is to trigger a bus reset, rebing or even reboot. > > > > > https://www.kernel.org/doc/html/v6.16-rc3/gpu/drm-uapi.html#device-wedging > > > > > >> > > >>> Some firmware errors can also cause > > >>> the device to enter this state and the only method to recover > > >>> from this would be to do a firmware flash > > >> > > >> What? What exactly do you mean with firmware flash here? > > >> > > >> Usually that means updating the firmware, but I don't see how this will > > >> bring you out of a wedge state? > > > > > > It means updating the firmware. > > > > > > Series: https://patchwork.freedesktop.org/series/149756/ > > > > > > In this xe kmd series, there are few firmware errors that cause the card > > > to be non-functional. The device is declared wedged and a firmware-flash > > > action is sent. > > > > Ok, so let me recap that just to make sure that I did understood that > > correctly. > > > > You find that the firmware flashed into the device is buggy and then raise > > a wedge event to automatically trigger a firmware update? > > > > Why not fail to load the driver in the first place? > > We already have that in place. If during the probe the fw machinery underneath > identified something is so bad that it needs to be flashed we boot in the > 'survivability mode'. The device is not discoverable for any gpu command > submission or memory management, but only fw flashing is possible on that > mode. > > This is on top of that. If the fw machinery had a bad unrecoverable error > and decided that fw updating is needed. > > > Or at least print a big warning into the system log? > > > > I mean a firmware update is usually something which the system > > administrator triggers very explicitly because when it fails for some > > reason (e.g. unexpected reset, power outage or whatever) it can sometimes > > brick the HW. > > > > I think it's rather brave to do this automatically. Are you sure we don't > > talk past each other on the meaning of the wedge event? > > The goal is not to do that automatically, but raise the uevent to the admin > with enough information that they can decide for the right correctable > action.
Christian, Andre, any concerns with this still? > > Thanks, > Rodrigo. > > > > > Thanks, > > Christian. > > > > > > > > There is corresponding fwupd PR in work that uses this uevent to trigger > > > a firmware flash > > > > > > fwupd PR: https://github.com/fwupd/fwupd/pull/8944/ > > > > > > Thanks > > > Riana > > > > > >> > > >> Where is the rest of the series? > > >> > > >> Regards, > > >> Christian. > > >> > > >>> v2: modify documentation (Raag, Rodrigo) > > >>> > > >>> Cc: André Almeida <andrealm...@igalia.com> > > >>> Cc: Christian König <christian.koe...@amd.com> > > >>> Signed-off-by: Riana Tauro <riana.ta...@intel.com> > > >>> --- > > >>> Documentation/gpu/drm-uapi.rst | 6 +++--- > > >>> drivers/gpu/drm/drm_drv.c | 2 ++ > > >>> include/drm/drm_device.h | 1 + > > >>> 3 files changed, 6 insertions(+), 3 deletions(-) > > >>> > > >>> diff --git a/Documentation/gpu/drm-uapi.rst > > >>> b/Documentation/gpu/drm-uapi.rst > > >>> index 263e5a97c080..cd2481458755 100644 > > >>> --- a/Documentation/gpu/drm-uapi.rst > > >>> +++ b/Documentation/gpu/drm-uapi.rst > > >>> @@ -422,9 +422,8 @@ Current implementation defines three recovery > > >>> methods, out of which, drivers > > >>> can use any one, multiple or none. Method(s) of choice will be sent > > >>> in the > > >>> uevent environment as ``WEDGED=<method1>[,..,<methodN>]`` in order of > > >>> less to > > >>> more side-effects. If driver is unsure about recovery or method is > > >>> unknown > > >>> -(like soft/hard system reboot, firmware flashing, physical device > > >>> replacement > > >>> -or any other procedure which can't be attempted on the fly), > > >>> ``WEDGED=unknown`` > > >>> -will be sent instead. > > >>> +(like soft/hard system reboot, physical device replacement or any > > >>> other procedure > > >>> +which can't be attempted on the fly), ``WEDGED=unknown`` will be sent > > >>> instead. > > >>> Userspace consumers can parse this event and attempt recovery as > > >>> per the > > >>> following expectations. > > >>> @@ -435,6 +434,7 @@ following expectations. > > >>> none optional telemetry collection > > >>> rebind unbind + bind driver > > >>> bus-reset unbind + bus reset/re-enumeration + bind > > >>> + firmware-flash firmware flash > > >>> unknown consumer policy > > >>> =============== ======================================== > > >>> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c > > >>> index 02556363e918..5f3bbe01c207 100644 > > >>> --- a/drivers/gpu/drm/drm_drv.c > > >>> +++ b/drivers/gpu/drm/drm_drv.c > > >>> @@ -535,6 +535,8 @@ static const char *drm_get_wedge_recovery(unsigned > > >>> int opt) > > >>> return "rebind"; > > >>> case DRM_WEDGE_RECOVERY_BUS_RESET: > > >>> return "bus-reset"; > > >>> + case DRM_WEDGE_RECOVERY_FW_FLASH: > > >>> + return "firmware-flash"; > > >>> default: > > >>> return NULL; > > >>> } > > >>> diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h > > >>> index 08b3b2467c4c..9d57c8882d93 100644 > > >>> --- a/include/drm/drm_device.h > > >>> +++ b/include/drm/drm_device.h > > >>> @@ -30,6 +30,7 @@ struct pci_controller; > > >>> #define DRM_WEDGE_RECOVERY_NONE BIT(0) /* optional > > >>> telemetry collection */ > > >>> #define DRM_WEDGE_RECOVERY_REBIND BIT(1) /* unbind + bind > > >>> driver */ > > >>> #define DRM_WEDGE_RECOVERY_BUS_RESET BIT(2) /* unbind + reset > > >>> bus device + bind */ > > >>> +#define DRM_WEDGE_RECOVERY_FW_FLASH BIT(3) /* firmware flash */ > > >>> /** > > >>> * struct drm_wedge_task_info - information about the guilty task of > > >>> a wedge dev > > >> > > > > > > > >