ras: Cancel and free message on get counter failure

sashiko-bot Wed, 24 Jun 2026 03:15:40 -0700

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) 
to consider:


Pre-existing issues:
- [Critical] Lockless access to `drm_ras_node` pointers from the global 
`drm_ras_xa` XArray leads to Use-After-Free and control-flow hijacking during 
concurrent device teardown.
- [Medium] Missing generation counter and NLM_F_DUMP_INTR flag for lockless 
Netlink dumps.
--

commit 17d70fc870ba118f3745064732f976f0856f2d46
Author: Raag Jadav <[email protected]>

drm/ras: Cancel and free message on get counter failure

This commit fixes a resource leak in doit_reply_value() where sk_buff
and the genetlink header were not cleaned up on a get counter failure.
It also consolidates the error handling using a goto pattern.

> diff --git a/drivers/gpu/drm/drm_ras.c b/drivers/gpu/drm/drm_ras.c
> index d6eab29a1394c..467a169026fc1 100644
> --- a/drivers/gpu/drm/drm_ras.c
> +++ b/drivers/gpu/drm/drm_ras.c
> @@ -201,25 +201,28 @@ static int doit_reply_value(struct genl_info *info, u32 
> node_id,
[ ... ]
>       ret = get_node_error_counter(node_id, error_id,
>                                    &error_name, &value);

[Severity: Critical]
This is a pre-existing issue, but does lockless access to drm_ras_node
pointers from the global drm_ras_xa XArray lead to a use-after-free
during concurrent device teardown?

The drm_ras netlink operations use .parallel_ops = true, meaning handlers
like get_node_error_counter() run concurrently without holding the global
genl_lock().

When retrieving a drm_ras_node pointer via xa_load(), the XArray API drops
the RCU read lock immediately after the lookup, returning a raw, unprotected
pointer:

drivers/gpu/drm/drm_ras.c:get_node_error_counter() {
        node = xa_load(&drm_ras_xa, node_id);
        if (!node || !node->query_error_counter)
                return -ENOENT;
        ...
        return node->query_error_counter(node, error_id, name, value);
}

Concurrently, a driver can unregister its nodes during teardown, which removes
the node from the XArray and synchronously frees the node memory via DRM's
managed cleanup.

Could the netlink handler then dereference this freed pointer, calling function
pointers read from freed memory?

This same pattern appears in drm_ras_nl_list_nodes_dumpit() and
get_node_error_threshold().


[Severity: Medium]
This is also a pre-existing issue, but are we missing a generation counter and
NLM_F_DUMP_INTR flag for lockless Netlink dumps?

The Generic Netlink UAPI rules dictate: "If iteration during a dump may skip or
repeat objects (e.g. due to lockless data structures), set NLM_F_DUMP_INTR on
the affected message(s)."

drm_ras_nl_list_nodes_dumpit() and drm_ras_nl_get_error_counter_dumpit()
iterate over the lockless drm_ras_xa XArray across potentially multiple netlink
continuations:

drivers/gpu/drm/drm_ras.c:drm_ras_nl_list_nodes_dumpit() {
        xa_for_each_start(&drm_ras_xa, id, node, ctx->restart) {
                ...
        }
}

Because drm_ras_xa can be modified concurrently by node registration or
unregistration, a multipart dump might skip newly registered elements. Should
the implementation maintain a generation counter or set NLM_F_DUMP_INTR on the
netlink message to inform userspace of the sequence change?

>       if (ret)
> -             return ret;
> +             goto cancel_msg;
>  
>       ret = msg_reply_value(msg, error_id, error_name, value);

-- 
Sashiko AI review · 
https://sashiko.dev/#/patchset/[email protected]?part=1

Re: [PATCH v4 1/5] drm/ras: Cancel and free message on get counter failure

Reply via email to