[AMD Official Use Only - AMD Internal Distribution Only]

> -----Original Message-----
> From: Lazar, Lijo <[email protected]>
> Sent: Wednesday, March 18, 2026 4:28 PM
> To: SHANMUGAM, SRINIVASAN <[email protected]>;
> Koenig, Christian <[email protected]>; Deucher, Alexander
> <[email protected]>
> Cc: [email protected]; Pelloux-Prayer, Pierre-Eric <Pierre-
> [email protected]>
> Subject: Re: [PATCH v3] drm/amdgpu: Avoid NULL dereference in discovery
> topology coredump path v3
>
>
>
> On 18-Mar-26 4:00 PM, Srinivasan Shanmugam wrote:
> > When a GPU fault or timeout happens, the driver creates a devcoredump
> > to collect debug information.
> >
> > During this, amdgpu_devcoredump_format() calls
> > amdgpu_discovery_dump() to print IP discovery data.
> >
> > amdgpu_discovery_dump() uses:
> >    adev->discovery.ip_top
> >
> > and then accesses:
> >    ip_top->die_kset
> >
> > amdgpu_discovery_dump() uses adev->discovery.ip_top. However, ip_top
> > may be NULL if the discovery topology was never initialized.
> >
> > The current code does not check for this before using ip_top. As a
> > result, when ip_top is NULL, the coredump worker crashes while taking
> > the spinlock for ip_top->die_kset.
> >
> > Fix this by checking for a missing ip_top before walking the discovery
> > topology. If it is unavailable, print a short message in the dump and
> > return safely.
> >
> > - If ip_top is NULL, print a message and skip the dump
> > - Also add the same check in the cleanup path
> >
> > This makes the coredump and cleanup paths safe even when the discovery
> > topology is not available.
> >
> > KASAN trace:
> > [  522.228252] [IGT] amd_deadlock: starting subtest
> > amdgpu-deadlock-sdma [  522.240681] [IGT] amd_deadlock: starting
> > dynamic subtest amdgpu-deadlock-sdma
> >
> > ...
> >
> > [  522.952317] Write of size 4 at addr 0000000000000050 by task
> > kworker/u129:5/5434 [  522.937526] BUG: KASAN: null-ptr-deref in
> > _raw_spin_lock+0x66/0xc0 [  522.967659] Workqueue: events_unbound
> > amdgpu_devcoredump_deferred_work [amdgpu]
> >
> > ...
> >
> > [  522.969445] Call Trace:
> > [  522.969508]  _raw_spin_lock+0x66/0xc0 [  522.969518]  ?
> > __pfx__raw_spin_lock+0x10/0x10 [  522.969534]
> > amdgpu_discovery_dump+0x61/0x530 [amdgpu] [  522.971346]  ?
> > pick_next_task_fair+0x3f6/0x1c60 [  522.971363]
> > amdgpu_devcoredump_format+0x84f/0x26f0 [amdgpu] [  522.973188]  ?
> > __pfx_amdgpu_devcoredump_format+0x10/0x10 [amdgpu] [  522.975012]  ?
> > psi_task_switch+0x2b5/0x9b0 [  522.975027]  ?
> > __pfx___drm_printfn_coredump+0x10/0x10 [drm] [  522.975198]  ?
> > __pfx___drm_puts_coredump+0x10/0x10 [drm] [  522.975366]  ?
> > __schedule+0x113c/0x38d0 [  522.975381]
> > amdgpu_devcoredump_deferred_work+0x4c/0x1f0 [amdgpu]
> >
> > v2: Updated commit message - Clarified that ip_top is not freed, it can
> >      just be NULL if discovery was not initialized. (Christian/Lijo)
> >
> > v3: Removed the extra drm_warn() for sysfs init failure as sysfs already
> >      reports errors. (Christian)
> >
> > Fixes: 7083eb8982fb ("drm/amdgpu: include ip discovery data in
> > devcoredump")
> > Cc: Pierre-Eric Pelloux-Prayer <[email protected]>
> > Cc: Christian König <[email protected]>
> > Cc: Alex Deucher <[email protected]>
> > Signed-off-by: Srinivasan Shanmugam <[email protected]>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c | 11 ++++++++++-
> >   1 file changed, 10 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> > index f7f37d93d0ce..6be1f971a31a 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> > @@ -1395,6 +1395,9 @@ static void amdgpu_discovery_sysfs_fini(struct
> amdgpu_device *adev)
> >     struct list_head *el, *tmp;
> >     struct kset *die_kset;
> >
> > +   if (!ip_top)
> > +           return;
> > +
> >     die_kset = &ip_top->die_kset;
> >     spin_lock(&die_kset->list_lock);
> >     list_for_each_prev_safe(el, tmp, &die_kset->list) { @@ -1419,9
> > +1422,15 @@ void amdgpu_discovery_dump(struct amdgpu_device *adev,
> struct drm_printer *p)
> >     struct ip_hw_instance *ip_inst;
> >     int i = 0, j;
> >
> > +   drm_printf(p, "\nHW IP Discovery\n");
> > +
> > +   if (!ip_top) {
> > +           drm_printf(p, "ip discovery topology unavailable\n");
>
> Is this type of printing really required or just skipping the whole section 
> good
> enough?


Silently skipping the rest may look like incomplete or missing data in the
coredump.

Adding a one-line message makes it clear that the topology was not
available, rather than leaving an empty section.

Best regards,
Srini

>
> Thanks,
> Lijo
>
> > +           return;
> > +   }
> > +
> >     die_kset = &ip_top->die_kset;
> >
> > -   drm_printf(p, "\nHW IP Discovery\n");
> >     spin_lock(&die_kset->list_lock);
> >     list_for_each(el_die, &die_kset->list) {
> >             drm_printf(p, "die %d\n", i++);

Reply via email to