Thank you for your contribution! Sashiko AI review found 2 potential issue(s) 
to consider:
- [Medium] Global struct fault_attr causes dentry reference leak and 
cross-device state contamination
- [Medium] Potential NULL pointer dereference during early probe if wedged 
event is sent before DRM device registration
--

commit f49d10330bc9238666f6e80cf2c7887368a6612b
Author: Mallesh Koujalagi <[email protected]>

drm/xe/ras: Use fault-inject to trigger punit error handler

Use fault-inject framework to trigger punit_error_handler()
for testing.

> diff --git a/drivers/gpu/drm/xe/xe_debugfs.c b/drivers/gpu/drm/xe/xe_debugfs.c
> index 22b471303984a..3a90deb5596ce 100644
> --- a/drivers/gpu/drm/xe/xe_debugfs.c
> +++ b/drivers/gpu/drm/xe/xe_debugfs.c
> @@ -40,6 +40,7 @@
>  
>  DECLARE_FAULT_ATTR(gt_reset_failure);
>  DECLARE_FAULT_ATTR(inject_csc_hw_error);
> +DECLARE_FAULT_ATTR(inject_punit_error);
>  
>  static void read_residency_counter(struct xe_device *xe, struct xe_mmio 
> *mmio,
>                                  u32 offset, const char *name, struct 
> drm_printer *p)

[ ... ]

> @@ -612,6 +613,8 @@ void xe_debugfs_register(struct xe_device *xe)
>  
>       fault_create_debugfs_attr("fail_gt_reset", root, &gt_reset_failure);
>  
> +     fault_create_debugfs_attr("inject_punit_error", root, 
> &inject_punit_error);

[Severity: Medium]
Since inject_punit_error is declared as a global attribute, will calling
fault_create_debugfs_attr() for each device overwrite attr->dname?

Looking at fault_create_debugfs_attr(), it unconditionally assigns
attr->dname = dget(dir). If the system has multiple XE GPUs, each
subsequent device's debugfs registration would overwrite attr->dname without
calling dput() on the previous reference, leaking the dentry.

Would this also expose a single global fault injection state across all
devices, meaning a write to one device's debugfs file unintentionally alters
the behavior for all other XE devices?

> +
>       if (IS_SRIOV_PF(xe))
>               xe_sriov_pf_debugfs_register(xe, root);
>       else if (IS_SRIOV_VF(xe))
> diff --git a/drivers/gpu/drm/xe/xe_ras.c b/drivers/gpu/drm/xe/xe_ras.c
> index 5a2fee5a13083..bb8502472f738 100644
> --- a/drivers/gpu/drm/xe/xe_ras.c
> +++ b/drivers/gpu/drm/xe/xe_ras.c

[ ... ]

> @@ -708,6 +717,11 @@ enum xe_ras_recovery_action xe_ras_process_errors(struct 
> xe_device *xe)
>       size_t rlen;
>       int ret;
>  
> +     if (fault_inject_punit_error()) {
> +             punit_error_handler(xe);
> +             return XE_RAS_RECOVERY_ACTION_DISCONNECT;
> +     }

[Severity: Medium]
Could this trigger a NULL pointer dereference during early probe?

xe_ras_init() is called during driver probe and executes
xe_ras_process_errors(). If the injected fault triggers here, it executes
punit_error_handler() which in turn calls xe_device_declare_wedged().

If xe_device_declare_wedged() invokes drm_dev_wedged_event() to notify
userspace before the DRM device interface is fully registered (before
drm_dev_register() initializes the minor's kdev), it might evaluate
&dev->primary->kdev->kobj and dereference NULL.

Because the inject_punit_error fault attribute is global, if a user sets
the probability to 100% on the first probed device, could it inadvertently
trigger during the probe of a subsequently attached device and crash the
system?

> +
>       if (!xe->info.has_sysctrl)
>               return XE_RAS_RECOVERY_ACTION_RESET;
>

-- 
Sashiko AI review ยท 
https://sashiko.dev/#/patchset/[email protected]?part=6

Reply via email to