On 3/18/26 11:30, Srinivasan Shanmugam wrote:
> When a GPU fault or timeout happens, the driver creates a devcoredump
> to collect debug information.
>
> During this, amdgpu_devcoredump_format() calls
> amdgpu_discovery_dump() to print IP discovery data.
>
> amdgpu_discovery_dump() uses:
> adev->discovery.ip_top
>
> and then accesses:
> ip_top->die_kset
>
> amdgpu_discovery_dump() uses adev->discovery.ip_top. However,
> ip_top may be NULL if the discovery topology was never initialized.
>
> The current code does not check for this before using ip_top. As a
> result, when ip_top is NULL, the coredump worker crashes while taking
> the spinlock for ip_top->die_kset.
>
> Fix this by checking for a missing ip_top before walking the discovery
> topology. If it is unavailable, print a short message in the dump and
> return safely.
>
> - If ip_top is NULL, print a message and skip the dump
> - Also add the same check in the cleanup path
>
> This makes the coredump and cleanup paths safe even when the
> discovery topology is not available.
>
> KASAN trace:
> [ 522.228252] [IGT] amd_deadlock: starting subtest amdgpu-deadlock-sdma
> [ 522.240681] [IGT] amd_deadlock: starting dynamic subtest
> amdgpu-deadlock-sdma
>
> ...
>
> [ 522.952317] Write of size 4 at addr 0000000000000050 by task
> kworker/u129:5/5434
> [ 522.937526] BUG: KASAN: null-ptr-deref in _raw_spin_lock+0x66/0xc0
> [ 522.967659] Workqueue: events_unbound amdgpu_devcoredump_deferred_work
> [amdgpu]
>
> ...
>
> [ 522.969445] Call Trace:
> [ 522.969508] _raw_spin_lock+0x66/0xc0
> [ 522.969518] ? __pfx__raw_spin_lock+0x10/0x10
> [ 522.969534] amdgpu_discovery_dump+0x61/0x530 [amdgpu]
> [ 522.971346] ? pick_next_task_fair+0x3f6/0x1c60
> [ 522.971363] amdgpu_devcoredump_format+0x84f/0x26f0 [amdgpu]
> [ 522.973188] ? __pfx_amdgpu_devcoredump_format+0x10/0x10 [amdgpu]
> [ 522.975012] ? psi_task_switch+0x2b5/0x9b0
> [ 522.975027] ? __pfx___drm_printfn_coredump+0x10/0x10 [drm]
> [ 522.975198] ? __pfx___drm_puts_coredump+0x10/0x10 [drm]
> [ 522.975366] ? __schedule+0x113c/0x38d0
> [ 522.975381] amdgpu_devcoredump_deferred_work+0x4c/0x1f0 [amdgpu]
>
> v2: Updated commit message - Clarified that ip_top is not freed, it can
> just be NULL if discovery was not initialized. (Christian/Lijo)
>
> v3: Removed the extra drm_warn() for sysfs init failure as sysfs already
> reports errors. (Christian)
>
> Fixes: 7083eb8982fb ("drm/amdgpu: include ip discovery data in devcoredump")
> Cc: Pierre-Eric Pelloux-Prayer <[email protected]>
> Cc: Christian König <[email protected]>
> Cc: Alex Deucher <[email protected]>
> Signed-off-by: Srinivasan Shanmugam <[email protected]>
Reviewed-by: Christian König <[email protected]>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c | 11 ++++++++++-
> 1 file changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> index f7f37d93d0ce..6be1f971a31a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> @@ -1395,6 +1395,9 @@ static void amdgpu_discovery_sysfs_fini(struct
> amdgpu_device *adev)
> struct list_head *el, *tmp;
> struct kset *die_kset;
>
> + if (!ip_top)
> + return;
> +
> die_kset = &ip_top->die_kset;
> spin_lock(&die_kset->list_lock);
> list_for_each_prev_safe(el, tmp, &die_kset->list) {
> @@ -1419,9 +1422,15 @@ void amdgpu_discovery_dump(struct amdgpu_device *adev,
> struct drm_printer *p)
> struct ip_hw_instance *ip_inst;
> int i = 0, j;
>
> + drm_printf(p, "\nHW IP Discovery\n");
> +
> + if (!ip_top) {
> + drm_printf(p, "ip discovery topology unavailable\n");
> + return;
> + }
> +
> die_kset = &ip_top->die_kset;
>
> - drm_printf(p, "\nHW IP Discovery\n");
> spin_lock(&die_kset->list_lock);
> list_for_each(el_die, &die_kset->list) {
> drm_printf(p, "die %d\n", i++);