[AMD Official Use Only - AMD Internal Distribution Only]

> -----Original Message-----
> From: Lazar, Lijo <[email protected]>
> Sent: Thursday, March 19, 2026 9:46 AM
> To: SHANMUGAM, SRINIVASAN <[email protected]>;
> Koenig, Christian <[email protected]>; Deucher, Alexander
> <[email protected]>
> Cc: [email protected]; Pelloux-Prayer, Pierre-Eric <Pierre-
> [email protected]>
> Subject: Re: [PATCH v3] drm/amdgpu: Avoid NULL dereference in discovery
> topology coredump path v3
>
>
>
> On 18-Mar-26 4:41 PM, SHANMUGAM, SRINIVASAN wrote:
> > [AMD Official Use Only - AMD Internal Distribution Only]
> >
> >> -----Original Message-----
> >> From: Lazar, Lijo <[email protected]>
> >> Sent: Wednesday, March 18, 2026 4:28 PM
> >> To: SHANMUGAM, SRINIVASAN <[email protected]>;
> Koenig,
> >> Christian <[email protected]>; Deucher, Alexander
> >> <[email protected]>
> >> Cc: [email protected]; Pelloux-Prayer, Pierre-Eric
> >> <Pierre- [email protected]>
> >> Subject: Re: [PATCH v3] drm/amdgpu: Avoid NULL dereference in
> >> discovery topology coredump path v3
> >>
> >>
> >>
> >> On 18-Mar-26 4:00 PM, Srinivasan Shanmugam wrote:
> >>> When a GPU fault or timeout happens, the driver creates a
> >>> devcoredump to collect debug information.
> >>>
> >>> During this, amdgpu_devcoredump_format() calls
> >>> amdgpu_discovery_dump() to print IP discovery data.
> >>>
> >>> amdgpu_discovery_dump() uses:
> >>>     adev->discovery.ip_top
> >>>
> >>> and then accesses:
> >>>     ip_top->die_kset
> >>>
> >>> amdgpu_discovery_dump() uses adev->discovery.ip_top. However, ip_top
> >>> may be NULL if the discovery topology was never initialized.
> >>>
> >>> The current code does not check for this before using ip_top. As a
> >>> result, when ip_top is NULL, the coredump worker crashes while
> >>> taking the spinlock for ip_top->die_kset.
> >>>
> >>> Fix this by checking for a missing ip_top before walking the
> >>> discovery topology. If it is unavailable, print a short message in
> >>> the dump and return safely.
> >>>
> >>> - If ip_top is NULL, print a message and skip the dump
> >>> - Also add the same check in the cleanup path
> >>>
> >>> This makes the coredump and cleanup paths safe even when the
> >>> discovery topology is not available.
> >>>
> >>> KASAN trace:
> >>> [  522.228252] [IGT] amd_deadlock: starting subtest
> >>> amdgpu-deadlock-sdma [  522.240681] [IGT] amd_deadlock: starting
> >>> dynamic subtest amdgpu-deadlock-sdma
> >>>
> >>> ...
> >>>
> >>> [  522.952317] Write of size 4 at addr 0000000000000050 by task
> >>> kworker/u129:5/5434 [  522.937526] BUG: KASAN: null-ptr-deref in
> >>> _raw_spin_lock+0x66/0xc0 [  522.967659] Workqueue: events_unbound
> >>> amdgpu_devcoredump_deferred_work [amdgpu]
> >>>
> >>> ...
> >>>
> >>> [  522.969445] Call Trace:
> >>> [  522.969508]  _raw_spin_lock+0x66/0xc0 [  522.969518]  ?
> >>> __pfx__raw_spin_lock+0x10/0x10 [  522.969534]
> >>> amdgpu_discovery_dump+0x61/0x530 [amdgpu] [  522.971346]  ?
> >>> pick_next_task_fair+0x3f6/0x1c60 [  522.971363]
> >>> amdgpu_devcoredump_format+0x84f/0x26f0 [amdgpu] [  522.973188]  ?
> >>> __pfx_amdgpu_devcoredump_format+0x10/0x10 [amdgpu] [  522.975012]  ?
> >>> psi_task_switch+0x2b5/0x9b0 [  522.975027]  ?
> >>> __pfx___drm_printfn_coredump+0x10/0x10 [drm] [  522.975198]  ?
> >>> __pfx___drm_puts_coredump+0x10/0x10 [drm] [  522.975366]  ?
> >>> __schedule+0x113c/0x38d0 [  522.975381]
> >>> amdgpu_devcoredump_deferred_work+0x4c/0x1f0 [amdgpu]
> >>>
> >>> v2: Updated commit message - Clarified that ip_top is not freed, it can
> >>>       just be NULL if discovery was not initialized.
> >>> (Christian/Lijo)
> >>>
> >>> v3: Removed the extra drm_warn() for sysfs init failure as sysfs already
> >>>       reports errors. (Christian)
> >>>
> >>> Fixes: 7083eb8982fb ("drm/amdgpu: include ip discovery data in
> >>> devcoredump")
> >>> Cc: Pierre-Eric Pelloux-Prayer <[email protected]>
> >>> Cc: Christian König <[email protected]>
> >>> Cc: Alex Deucher <[email protected]>
> >>> Signed-off-by: Srinivasan Shanmugam <[email protected]>
> >>> ---
> >>>    drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c | 11 ++++++++++-
> >>>    1 file changed, 10 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> >>> index f7f37d93d0ce..6be1f971a31a 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> >>> @@ -1395,6 +1395,9 @@ static void amdgpu_discovery_sysfs_fini(struct
> >> amdgpu_device *adev)
> >>>      struct list_head *el, *tmp;
> >>>      struct kset *die_kset;
> >>>
> >>> +   if (!ip_top)
> >>> +           return;
> >>> +
> >>>      die_kset = &ip_top->die_kset;
> >>>      spin_lock(&die_kset->list_lock);
> >>>      list_for_each_prev_safe(el, tmp, &die_kset->list) { @@ -1419,9
> >>> +1422,15 @@ void amdgpu_discovery_dump(struct amdgpu_device *adev,
> >> struct drm_printer *p)
> >>>      struct ip_hw_instance *ip_inst;
> >>>      int i = 0, j;
> >>>
> >>> +   drm_printf(p, "\nHW IP Discovery\n");
> >>> +
> >>> +   if (!ip_top) {
> >>> +           drm_printf(p, "ip discovery topology unavailable\n");
> >>
> >> Is this type of printing really required or just skipping the whole
> >> section good enough?
> >
> >
> > Silently skipping the rest may look like incomplete or missing data in
> > the coredump.
> >
> > Adding a one-line message makes it clear that the topology was not
> > available, rather than leaving an empty section.
> >
>
> Here is my take - Discovery is the basic requirement for SOCs which make use 
> of
> that mechanism and it is always expected to be present for those, otherwise 
> driver
> load will fail.
>
> For those which don't make use of discovery, the section will not be present. 
> There
> is no special message required for that. There is no harm to keep that, it 
> only adds
> extra parsing.

It's ok for normal cases. However, this runs in the *coredump/debug path
during error conditions* - where the discovery topology may not be
initialized. instead of leaving the section empty, which can be confusing
during debugging.

Best,
Srini

Reply via email to