Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider: - [Medium] Global struct fault_attr causes dentry reference leak and cross-device state contamination - [Medium] Potential NULL pointer dereference during early probe if wedged event is sent before DRM device registration --
commit f49d10330bc9238666f6e80cf2c7887368a6612b Author: Mallesh Koujalagi <[email protected]> drm/xe/ras: Use fault-inject to trigger punit error handler Use fault-inject framework to trigger punit_error_handler() for testing. > diff --git a/drivers/gpu/drm/xe/xe_debugfs.c b/drivers/gpu/drm/xe/xe_debugfs.c > index 22b471303984a..3a90deb5596ce 100644 > --- a/drivers/gpu/drm/xe/xe_debugfs.c > +++ b/drivers/gpu/drm/xe/xe_debugfs.c > @@ -40,6 +40,7 @@ > > DECLARE_FAULT_ATTR(gt_reset_failure); > DECLARE_FAULT_ATTR(inject_csc_hw_error); > +DECLARE_FAULT_ATTR(inject_punit_error); > > static void read_residency_counter(struct xe_device *xe, struct xe_mmio > *mmio, > u32 offset, const char *name, struct > drm_printer *p) [ ... ] > @@ -612,6 +613,8 @@ void xe_debugfs_register(struct xe_device *xe) > > fault_create_debugfs_attr("fail_gt_reset", root, >_reset_failure); > > + fault_create_debugfs_attr("inject_punit_error", root, > &inject_punit_error); [Severity: Medium] Since inject_punit_error is declared as a global attribute, will calling fault_create_debugfs_attr() for each device overwrite attr->dname? Looking at fault_create_debugfs_attr(), it unconditionally assigns attr->dname = dget(dir). If the system has multiple XE GPUs, each subsequent device's debugfs registration would overwrite attr->dname without calling dput() on the previous reference, leaking the dentry. Would this also expose a single global fault injection state across all devices, meaning a write to one device's debugfs file unintentionally alters the behavior for all other XE devices? > + > if (IS_SRIOV_PF(xe)) > xe_sriov_pf_debugfs_register(xe, root); > else if (IS_SRIOV_VF(xe)) > diff --git a/drivers/gpu/drm/xe/xe_ras.c b/drivers/gpu/drm/xe/xe_ras.c > index 5a2fee5a13083..bb8502472f738 100644 > --- a/drivers/gpu/drm/xe/xe_ras.c > +++ b/drivers/gpu/drm/xe/xe_ras.c [ ... ] > @@ -708,6 +717,11 @@ enum xe_ras_recovery_action xe_ras_process_errors(struct > xe_device *xe) > size_t rlen; > int ret; > > + if (fault_inject_punit_error()) { > + punit_error_handler(xe); > + return XE_RAS_RECOVERY_ACTION_DISCONNECT; > + } [Severity: Medium] Could this trigger a NULL pointer dereference during early probe? xe_ras_init() is called during driver probe and executes xe_ras_process_errors(). If the injected fault triggers here, it executes punit_error_handler() which in turn calls xe_device_declare_wedged(). If xe_device_declare_wedged() invokes drm_dev_wedged_event() to notify userspace before the DRM device interface is fully registered (before drm_dev_register() initializes the minor's kdev), it might evaluate &dev->primary->kdev->kobj and dereference NULL. Because the inject_punit_error fault attribute is global, if a user sets the probability to 100% on the first probed device, could it inadvertently trigger during the probe of a subsequently attached device and crash the system? > + > if (!xe->info.has_sysctrl) > return XE_RAS_RECOVERY_ACTION_RESET; > -- Sashiko AI review ยท https://sashiko.dev/#/patchset/[email protected]?part=6
