Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
Pre-existing issues: - [High] Directly invoking driver `remove()` and `probe()` callbacks from the PCI error handler bypasses the Linux device model, leading to driver core state corruption and subsequent crashes. - [High] WEDGE_STR_LEN buffer size is too small to hold all recovery methods, leading to silently truncated uevent strings that userspace cannot parse. -- commit e0ad0ce6dec9b17be989295e8a1391f07c6154a5 Author: Mallesh Koujalagi <[email protected]> drm: Add DRM_WEDGE_RECOVERY_COLD_RESET recovery method Introduce DRM_WEDGE_RECOVERY_COLD_RESET (BIT(4)) recovery method to handle scenarios requiring device power cycle. This method addresses cases where other recovery mechanisms are insufficient to restore device functionality. > diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c > index 675675480da49..19c25366635e9 100644 > --- a/drivers/gpu/drm/drm_drv.c > +++ b/drivers/gpu/drm/drm_drv.c > @@ -535,6 +535,8 @@ static const char *drm_get_wedge_recovery(unsigned int > opt) > return "bus-reset"; > case DRM_WEDGE_RECOVERY_VENDOR: > return "vendor-specific"; > + case DRM_WEDGE_RECOVERY_COLD_RESET: > + return "cold-reset"; [Severity: High] This is a pre-existing issue, but does adding the cold-reset method make the WEDGE_STR_LEN buffer too small to hold all recovery methods? Could this lead to silently truncated uevent strings that userspace cannot parse? In drm_dev_wedged_event(), the event_string buffer is hardcoded to 32 bytes: drivers/gpu/drm/drm_drv.c:drm_dev_wedged_event() { char event_string[WEDGE_STR_LEN]; ... len = scnprintf(event_string, sizeof(event_string), "%s", "WEDGED="); for_each_set_bit(opt, &method, BITS_PER_TYPE(method)) { ... len += scnprintf(event_string + len, sizeof(event_string) - len, "%s,", recovery); } if (recovery) /* Get rid of trailing comma */ event_string[len - 1] = '\0'; ... } When formatting multiple methods (e.g. "WEDGED=rebind,bus-reset,cold-reset"), the string length easily exceeds 32 bytes. If scnprintf() hits the buffer limit, it returns 0 on the last iteration. Decrementing len then chops off the last valid character of the truncated string, resulting in a corrupted uevent that breaks userspace parsing. > default: > return NULL; > } [ ... ] > diff --git a/drivers/gpu/drm/xe/xe_pci_error.c > b/drivers/gpu/drm/xe/xe_pci_error.c > new file mode 100644 > index 0000000000000..b08601f470d63 > --- /dev/null > +++ b/drivers/gpu/drm/xe/xe_pci_error.c > @@ -0,0 +1,135 @@ [ ... ] > +static pci_ers_result_t xe_pci_error_slot_reset(struct pci_dev *pdev) > +{ > + const struct pci_device_id *ent = pci_match_id(pdev->driver->id_table, > pdev); > + struct xe_device *xe = pdev_to_xe_device(pdev); [ ... ] > + xe_device_clear_in_reset(xe); > + pdev->driver->remove(pdev); > + devres_release_group(&pdev->dev, xe->devres_group_id); > + > + if (pdev->driver->probe(pdev, ent)) [Severity: High] This isn't a bug introduced by this patch, but does directly invoking the driver remove() and probe() callbacks bypass the Linux device model? The driver core still considers the device 'bound'. If the manual probe() fails, it returns PCI_ERS_RESULT_DISCONNECT, causing the PCI core to formally unbind the device later. This invokes xe_pci_remove() on a partially initialized or already manually-removed state, which could lead to resource double-frees and NULL pointer dereferences. > + return PCI_ERS_RESULT_DISCONNECT; > + > + xe = pdev_to_xe_device(pdev); > + > + /* Wedge the device to prevent I/O operations till the resume callback > */ > + xe_device_wedged_get(xe); > + > + return PCI_ERS_RESULT_RECOVERED; > +} -- Sashiko AI review ยท https://sashiko.dev/#/patchset/[email protected]?part=2
