On 2/11/26 12:59, Mallesh Koujalagi wrote:
> This RFC patch series introduces a new DRM wedge recovery method
> 'DRM_WEDGE_RECOVERY_COLD_RESET' for handling critical errors
> that cannot be recovered through existing software-based mechanisms.
> 
> Background
> ----------
> Current recovery methods (driver rebind, bus reset, FLR) are effective
> for most error scenarios. However, certain critical errors
> affect device-level persistent state that survives warm resets and
> software recovery attempts. These errors require complete device power
> cycling to restore functionality.

I don't think that this is a sufficient justification for making those changes.

Especially since the patch set doesn't seem to add any detection for those 
cases, but rather just exposes a debugfs file to trigger them.

So what is the actual technical background? In other words when is that 
necessary?

Regards,
Christian.

> 
> Proposed Solution
> -----------------
> This series adds DRM_WEDGE_RECOVERY_COLD_RESET (BIT(4)) as a new
> recovery method to the DRM wedging framework. When this method is set,
> it signals to userspace that only a complete device cold reset (power
> cycle) can restore normal operation.
> 
> Example uevent received:
>   SUBSYSTEM=drm
>   WEDGED=cold-reset
>   DEVPATH=/devices/.../drm/card0
> 
> Testing
> -------
> The debugfs interface allows testing the cold reset recovery path:
> 
>   echo 1 > /sys/kernel/debug/dri/N/trigger_critical_error
> 
> This triggers the critical error handler, wedges the device with
> cold reset method, and sends the appropriate uevent to userspace.
> 
> Cc: André Almeida <[email protected]>
> Cc: Christian König <[email protected]>
> Cc: David Airlie <[email protected]>
> Cc: Simona Vetter <[email protected]>
> Cc: Maxime Ripard <[email protected]>
> 
> Mallesh Koujalagi (4):
>   drm: Add DRM_WEDGE_RECOVERY_COLD_RESET for critical error
>   drm/doc: Document DRM_WEDGE_RECOVERY_COLD_RESET recovery method
>   drm/xe: Add handler for critical errors which require cold-reset
>   drm/xe/debugfs: Add interface to trigger critical error handler
> 
>  Documentation/gpu/drm-uapi.rst   | 73 +++++++++++++++++++++++++++++++-
>  drivers/gpu/drm/drm_drv.c        |  2 +
>  drivers/gpu/drm/xe/xe_debugfs.c  | 38 +++++++++++++++++
>  drivers/gpu/drm/xe/xe_hw_error.c | 28 ++++++++++++
>  drivers/gpu/drm/xe/xe_hw_error.h |  1 +
>  include/drm/drm_device.h         |  1 +
>  6 files changed, 142 insertions(+), 1 deletion(-)
> 

Reply via email to