xe: Handle PUNIT errors by requesting cold-reset recovery

sashiko-bot Fri, 05 Jun 2026 05:48:31 -0700

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) 
to consider:


Pre-existing issues:
- [High] Emitting a DRM wedged uevent before the device is registered in sysfs 
causes the uevent to be silently dropped, losing the cold-reset notification.
- [High] Concurrent modification of unprotected `xe->wedged.method` can cause 
the `COLD_RESET` recovery method to be overwritten.
--

commit d8f11e3cba2116967c1b1af489f30a2fd996f664
Author: Mallesh Koujalagi <[email protected]>

drm/xe: Handle PUNIT errors by requesting cold-reset recovery

When PUNIT errors are detected that persist across warm resets, this
commit marks the device as wedged with DRM_WEDGE_RECOVERY_COLD_RESET
and notifies userspace that a complete device power cycle is required.

> diff --git a/drivers/gpu/drm/xe/xe_ras.c b/drivers/gpu/drm/xe/xe_ras.c
> --- a/drivers/gpu/drm/xe/xe_ras.c
> +++ b/drivers/gpu/drm/xe/xe_ras.c

[ ... ]

> @@ -207,6 +209,12 @@ static enum xe_ras_recovery_action 
> handle_core_compute_errors(struct xe_ras_erro
>       return XE_RAS_RECOVERY_ACTION_RECOVERED;
>  }
>  
> +static void punit_error_handler(struct xe_device *xe)
> +{
> +     xe_device_set_wedged_method(xe, DRM_WEDGE_RECOVERY_COLD_RESET);

[Severity: High]
This is a pre-existing issue, but does concurrent modification of the
unprotected xe->wedged.method here cause the COLD_RESET recovery method to
be overwritten?

If a concurrent event (such as a GT hang timeout) triggers
xe_device_declare_wedged() concurrently, there appears to be a race
condition. The concurrent thread may evaluate if (!xe->wedged.method) as
true just before punit_error_handler() executes xe_device_set_wedged_method()
with DRM_WEDGE_RECOVERY_COLD_RESET.

The concurrent thread would then execute:

xe_device_set_wedged_method(xe, DRM_WEDGE_RECOVERY_REBIND |
                                DRM_WEDGE_RECOVERY_BUS_RESET);

Will this overwrite the COLD_RESET method and result in an incorrect
recovery uevent being broadcast to userspace?

> +     xe_device_declare_wedged(xe);

[Severity: High]
This isn't a bug introduced by this patch, but does emitting a DRM wedged
uevent before the device is registered in sysfs cause the uevent to be
silently dropped?

During driver initialization, xe_device_probe() calls xe_ras_init(), which
processes preexisting hardware errors. If a PUNIT error is detected at this
stage, punit_error_handler() marks the device as wedged and calls
xe_device_declare_wedged(), which in turn calls drm_dev_wedged_event().

Since drm_dev_register() has not yet been called at this stage of the probe,
device_add() has not executed, leaving state_in_sysfs = 0 for the kobject.
The core kobject_uevent_env() explicitly drops events for unlinked kobjects.

Does this cause the WEDGED=cold-reset uevent notification intended for
userspace to be permanently lost, leaving the device unusable without
userspace automation knowing it needs to power-cycle the slot?

> +}
> +
>  static enum xe_ras_recovery_action handle_soc_internal_errors(struct 
> xe_device *xe,

-- 
Sashiko AI review · 
https://sashiko.dev/#/patchset/[email protected]?part=4

Re: [PATCH v7 4/6] drm/xe: Handle PUNIT errors by requesting cold-reset recovery

Reply via email to