Hi Christian,
On 11-02-2026 05:57 pm, Christian König wrote:
On 2/11/26 12:59, Mallesh Koujalagi wrote:
This RFC patch series introduces a new DRM wedge recovery method
'DRM_WEDGE_RECOVERY_COLD_RESET' for handling critical errors
that cannot be recovered through existing software-based mechanisms.
Background
----------
Current recovery methods (driver rebind, bus reset, FLR) are effective
for most error scenarios. However, certain critical errors
affect device-level persistent state that survives warm resets and
software recovery attempts. These errors require complete device power
cycling to restore functionality.
I don't think that this is a sufficient justification for making those changes.
Especially since the patch set doesn't seem to add any detection for those
cases, but rather just exposes a debugfs file to trigger them.
So what is the actual technical background? In other words when is that
necessary?
Regards,
Christian.
Thanks for the feedback. Sorry I missed to add reference of actual usecase.
This method is for handling errors from power management unit, which
requires
a complete power cycle (cold reset) to recover.
I'll add actual implementation in next revision. It will be an extension of
our WIP RAS infrastructure which is being developed in parallel.
(see https://patchwork.freedesktop.org/series/160482/)
Current RFC series is to get community insight on proposed recovery method
as part of wedging uapi.
Thanks,
-/Mallesh
Proposed Solution
-----------------
This series adds DRM_WEDGE_RECOVERY_COLD_RESET (BIT(4)) as a new
recovery method to the DRM wedging framework. When this method is set,
it signals to userspace that only a complete device cold reset (power
cycle) can restore normal operation.
Example uevent received:
SUBSYSTEM=drm
WEDGED=cold-reset
DEVPATH=/devices/.../drm/card0
Testing
-------
The debugfs interface allows testing the cold reset recovery path:
echo 1 > /sys/kernel/debug/dri/N/trigger_critical_error
This triggers the critical error handler, wedges the device with
cold reset method, and sends the appropriate uevent to userspace.
Cc: André Almeida <[email protected]>
Cc: Christian König <[email protected]>
Cc: David Airlie <[email protected]>
Cc: Simona Vetter <[email protected]>
Cc: Maxime Ripard <[email protected]>
Mallesh Koujalagi (4):
drm: Add DRM_WEDGE_RECOVERY_COLD_RESET for critical error
drm/doc: Document DRM_WEDGE_RECOVERY_COLD_RESET recovery method
drm/xe: Add handler for critical errors which require cold-reset
drm/xe/debugfs: Add interface to trigger critical error handler
Documentation/gpu/drm-uapi.rst | 73 +++++++++++++++++++++++++++++++-
drivers/gpu/drm/drm_drv.c | 2 +
drivers/gpu/drm/xe/xe_debugfs.c | 38 +++++++++++++++++
drivers/gpu/drm/xe/xe_hw_error.c | 28 ++++++++++++
drivers/gpu/drm/xe/xe_hw_error.h | 1 +
include/drm/drm_device.h | 1 +
6 files changed, 142 insertions(+), 1 deletion(-)