xe_drm_ras: Add error-event support for PVC

sashiko-bot Tue, 23 Jun 2026 13:58:26 -0700

Thank you for your contribution! Sashiko AI review found 3 potential issue(s) 
to consider:
- [High] Potential console flood due to unratelimited `drm_err` in atomic 
context when netlink multicast fails.
- [Medium] Emitting RAS events with unchanged/zero counter values for uncounted 
errors.
- [Medium] Pointer arithmetic is performed on a potentially NULL pointer before 
checking if the DRM RAS feature is enabled.
--


commit 948141f8cea37215cdae8a5d0e089b0c6b181396
Author: Riana Tauro <[email protected]>

drm/xe/xe_drm_ras: Add error-event support for PVC

Report drm_ras error event to userspace when an error occurs.
Add support for core-compute and SoC errors in PVC.

> diff --git a/drivers/gpu/drm/xe/xe_drm_ras.c b/drivers/gpu/drm/xe/xe_drm_ras.c
> index 7937d8ba0ed9c..36afdfb5e4120 100644
> --- a/drivers/gpu/drm/xe/xe_drm_ras.c
> +++ b/drivers/gpu/drm/xe/xe_drm_ras.c

[ ... ]

> @@ -185,6 +185,36 @@ static int register_nodes(struct xe_device *xe)
>       return ret;
>  }
>  
> +/**
> + * xe_drm_ras_event() - Report drm-ras error event to userspace
> + * @xe: xe device structure
> + * @component: error component (see &enum drm_xe_ras_error_component)
> + * @severity: error severity (see &enum drm_xe_ras_error_severity)
> + * @value: value of error counter
> + * @flags: flags for allocation
> + *
> + * Report an error-event to userspace.
> + */
> +void xe_drm_ras_event(struct xe_device *xe, u32 component, u32 severity, u32 
> value, gfp_t flags)
> +{
> +     struct xe_drm_ras *ras = &xe->ras;
> +     struct xe_drm_ras_counter *info = ras->info[severity];
> +     struct drm_ras_node *node = &ras->node[severity];

[Severity: Medium]
Is it safe to compute the node pointer here before checking if the DRM
RAS feature is enabled?

If xe->info.has_drm_ras is false, xe_drm_ras_init() might not have been
called, leaving ras->node as NULL. Evaluating &ras->node[severity] on a
NULL pointer results in undefined behavior which could be flagged by UBSAN
or compiler optimizations.

> +     int ret;
> +
> +     /* Event is supported only if drm_ras is enabled */
> +     if (!xe->info.has_drm_ras)
> +             return;
> +
> +     if (!info || !info[component].name)
> +             return;
> +
> +     ret = drm_ras_nl_error_event(node, component, info[component].name, 
> value, flags);
> +     if (ret)
> +             drm_err(&xe->drm, "RAS error-event failed: %d for %s %s\n", ret,
> +                     info[component].name, error_severity[severity]);

[Severity: High]
Could this unratelimited drm_err() cause a console flood?

Since xe_drm_ras_event() is called from hw_error_source_handler() in
hardware error interrupt context, an interrupt storm could occur.
If drm_ras_nl_error_event() fails (e.g., due to GFP_ATOMIC allocation
failures under memory pressure), this will rapidly print errors to
the console.

Should this use drm_err_ratelimited() similar to how the caller
handles its logging?

> +}
> +
>  /**
>   * xe_drm_ras_init() - Initialize DRM RAS
>   * @xe: xe device instance

> diff --git a/drivers/gpu/drm/xe/xe_hw_error.c 
> b/drivers/gpu/drm/xe/xe_hw_error.c
> index 4a4b363fc844b..a833cecc74ec5 100644
> --- a/drivers/gpu/drm/xe/xe_hw_error.c
> +++ b/drivers/gpu/drm/xe/xe_hw_error.c

[ ... ]

> @@ -495,6 +495,9 @@ static void hw_error_source_handler(struct xe_tile *tile, 
> const enum hardware_er
>                       gt_hw_error_handler(tile, hw_err, error_id);
>               if (err_bit == XE_SOC_ERROR)
>                       soc_hw_error_handler(tile, hw_err, error_id);
> +
> +             value = atomic_read(&info[error_id].counter);
> +             xe_drm_ras_event(xe, error_id, severity, value, GFP_ATOMIC);

[Severity: Medium]
Will this emit an event with an unchanged or zero counter value for
correctable SOC errors?

In soc_hw_error_handler(), if the error is HARDWARE_ERROR_CORRECTABLE,
it clears the registers and jumps to unmask_gsysevtctl without
incrementing the error counter:

soc_hw_error_handler() {
    ...
    if (hw_err == HARDWARE_ERROR_CORRECTABLE) {
        ...
        goto unmask_gsysevtctl;
    }
    ...
}

However, here in hw_error_source_handler(), xe_drm_ras_event() is
called unconditionally after soc_hw_error_handler() returns.
Does this result in userspace receiving an error event where the
counter value hasn't actually changed?

>       }
>  
>  clear_reg:

-- 
Sashiko AI review · 
https://sashiko.dev/#/patchset/[email protected]?part=2

Re: [PATCH v3 2/3] drm/xe/xe_drm_ras: Add error-event support for PVC

Reply via email to