On 2/11/26 12:59, Mallesh Koujalagi wrote: > This RFC patch series introduces a new DRM wedge recovery method > 'DRM_WEDGE_RECOVERY_COLD_RESET' for handling critical errors > that cannot be recovered through existing software-based mechanisms. > > Background > ---------- > Current recovery methods (driver rebind, bus reset, FLR) are effective > for most error scenarios. However, certain critical errors > affect device-level persistent state that survives warm resets and > software recovery attempts. These errors require complete device power > cycling to restore functionality.
I don't think that this is a sufficient justification for making those changes. Especially since the patch set doesn't seem to add any detection for those cases, but rather just exposes a debugfs file to trigger them. So what is the actual technical background? In other words when is that necessary? Regards, Christian. > > Proposed Solution > ----------------- > This series adds DRM_WEDGE_RECOVERY_COLD_RESET (BIT(4)) as a new > recovery method to the DRM wedging framework. When this method is set, > it signals to userspace that only a complete device cold reset (power > cycle) can restore normal operation. > > Example uevent received: > SUBSYSTEM=drm > WEDGED=cold-reset > DEVPATH=/devices/.../drm/card0 > > Testing > ------- > The debugfs interface allows testing the cold reset recovery path: > > echo 1 > /sys/kernel/debug/dri/N/trigger_critical_error > > This triggers the critical error handler, wedges the device with > cold reset method, and sends the appropriate uevent to userspace. > > Cc: André Almeida <[email protected]> > Cc: Christian König <[email protected]> > Cc: David Airlie <[email protected]> > Cc: Simona Vetter <[email protected]> > Cc: Maxime Ripard <[email protected]> > > Mallesh Koujalagi (4): > drm: Add DRM_WEDGE_RECOVERY_COLD_RESET for critical error > drm/doc: Document DRM_WEDGE_RECOVERY_COLD_RESET recovery method > drm/xe: Add handler for critical errors which require cold-reset > drm/xe/debugfs: Add interface to trigger critical error handler > > Documentation/gpu/drm-uapi.rst | 73 +++++++++++++++++++++++++++++++- > drivers/gpu/drm/drm_drv.c | 2 + > drivers/gpu/drm/xe/xe_debugfs.c | 38 +++++++++++++++++ > drivers/gpu/drm/xe/xe_hw_error.c | 28 ++++++++++++ > drivers/gpu/drm/xe/xe_hw_error.h | 1 + > include/drm/drm_device.h | 1 + > 6 files changed, 142 insertions(+), 1 deletion(-) >
