Re: [PATCH v9 2/6] drm: Add DRM_WEDGE_RECOVERY_COLD_RESET recovery method

sashiko-bot Tue, 23 Jun 2026 14:04:56 -0700

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) 
to consider:


Pre-existing issues:
- [High] Directly invoking driver `remove()` and `probe()` callbacks from the 
PCI error handler bypasses the Linux device model, leading to driver core state 
corruption and subsequent crashes.
- [High] WEDGE_STR_LEN buffer size is too small to hold all recovery methods, 
leading to silently truncated uevent strings that userspace cannot parse.
--

commit e0ad0ce6dec9b17be989295e8a1391f07c6154a5
Author: Mallesh Koujalagi <[email protected]>

drm: Add DRM_WEDGE_RECOVERY_COLD_RESET recovery method

Introduce DRM_WEDGE_RECOVERY_COLD_RESET (BIT(4)) recovery method to handle
scenarios requiring device power cycle. This method addresses cases where other
recovery mechanisms are insufficient to restore device functionality.

> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> index 675675480da49..19c25366635e9 100644
> --- a/drivers/gpu/drm/drm_drv.c
> +++ b/drivers/gpu/drm/drm_drv.c
> @@ -535,6 +535,8 @@ static const char *drm_get_wedge_recovery(unsigned int 
> opt)
>               return "bus-reset";
>       case DRM_WEDGE_RECOVERY_VENDOR:
>               return "vendor-specific";
> +     case DRM_WEDGE_RECOVERY_COLD_RESET:
> +             return "cold-reset";

[Severity: High]
This is a pre-existing issue, but does adding the cold-reset method make the
WEDGE_STR_LEN buffer too small to hold all recovery methods?

Could this lead to silently truncated uevent strings that userspace cannot
parse?

In drm_dev_wedged_event(), the event_string buffer is hardcoded to 32 bytes:

drivers/gpu/drm/drm_drv.c:drm_dev_wedged_event() {
        char event_string[WEDGE_STR_LEN];
        ...
        len = scnprintf(event_string, sizeof(event_string), "%s", "WEDGED=");

        for_each_set_bit(opt, &method, BITS_PER_TYPE(method)) {
                ...
                len += scnprintf(event_string + len, sizeof(event_string) - 
len, "%s,", recovery);
        }

        if (recovery)
                /* Get rid of trailing comma */
                event_string[len - 1] = '\0';
        ...
}

When formatting multiple methods (e.g. "WEDGED=rebind,bus-reset,cold-reset"),
the string length easily exceeds 32 bytes. If scnprintf() hits the buffer limit,
it returns 0 on the last iteration. Decrementing len then chops off the last
valid character of the truncated string, resulting in a corrupted uevent that
breaks userspace parsing.

>       default:
>               return NULL;
>       }

[ ... ]

> diff --git a/drivers/gpu/drm/xe/xe_pci_error.c 
> b/drivers/gpu/drm/xe/xe_pci_error.c
> new file mode 100644
> index 0000000000000..b08601f470d63
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_pci_error.c
> @@ -0,0 +1,135 @@

[ ... ]

> +static pci_ers_result_t xe_pci_error_slot_reset(struct pci_dev *pdev)
> +{
> +     const struct pci_device_id *ent = pci_match_id(pdev->driver->id_table, 
> pdev);
> +     struct xe_device *xe = pdev_to_xe_device(pdev);

[ ... ]

> +     xe_device_clear_in_reset(xe);
> +     pdev->driver->remove(pdev);
> +     devres_release_group(&pdev->dev, xe->devres_group_id);
> +
> +     if (pdev->driver->probe(pdev, ent))

[Severity: High]
This isn't a bug introduced by this patch, but does directly invoking the driver
remove() and probe() callbacks bypass the Linux device model?

The driver core still considers the device 'bound'. If the manual probe()
fails, it returns PCI_ERS_RESULT_DISCONNECT, causing the PCI core to formally
unbind the device later.

This invokes xe_pci_remove() on a partially initialized or already
manually-removed state, which could lead to resource double-frees and NULL
pointer dereferences.

> +             return PCI_ERS_RESULT_DISCONNECT;
> +
> +     xe = pdev_to_xe_device(pdev);
> +
> +     /* Wedge the device to prevent I/O operations till the resume callback 
> */
> +     xe_device_wedged_get(xe);
> +
> +     return PCI_ERS_RESULT_RECOVERED;
> +}

-- 
Sashiko AI review · 
https://sashiko.dev/#/patchset/[email protected]?part=2

Re: [PATCH v9 2/6] drm: Add DRM_WEDGE_RECOVERY_COLD_RESET recovery method

Reply via email to