[AMD Official Use Only - AMD Internal Distribution Only] > -----Original Message----- > From: Lazar, Lijo <[email protected]> > Sent: Thursday, March 19, 2026 9:46 AM > To: SHANMUGAM, SRINIVASAN <[email protected]>; > Koenig, Christian <[email protected]>; Deucher, Alexander > <[email protected]> > Cc: [email protected]; Pelloux-Prayer, Pierre-Eric <Pierre- > [email protected]> > Subject: Re: [PATCH v3] drm/amdgpu: Avoid NULL dereference in discovery > topology coredump path v3 > > > > On 18-Mar-26 4:41 PM, SHANMUGAM, SRINIVASAN wrote: > > [AMD Official Use Only - AMD Internal Distribution Only] > > > >> -----Original Message----- > >> From: Lazar, Lijo <[email protected]> > >> Sent: Wednesday, March 18, 2026 4:28 PM > >> To: SHANMUGAM, SRINIVASAN <[email protected]>; > Koenig, > >> Christian <[email protected]>; Deucher, Alexander > >> <[email protected]> > >> Cc: [email protected]; Pelloux-Prayer, Pierre-Eric > >> <Pierre- [email protected]> > >> Subject: Re: [PATCH v3] drm/amdgpu: Avoid NULL dereference in > >> discovery topology coredump path v3 > >> > >> > >> > >> On 18-Mar-26 4:00 PM, Srinivasan Shanmugam wrote: > >>> When a GPU fault or timeout happens, the driver creates a > >>> devcoredump to collect debug information. > >>> > >>> During this, amdgpu_devcoredump_format() calls > >>> amdgpu_discovery_dump() to print IP discovery data. > >>> > >>> amdgpu_discovery_dump() uses: > >>> adev->discovery.ip_top > >>> > >>> and then accesses: > >>> ip_top->die_kset > >>> > >>> amdgpu_discovery_dump() uses adev->discovery.ip_top. However, ip_top > >>> may be NULL if the discovery topology was never initialized. > >>> > >>> The current code does not check for this before using ip_top. As a > >>> result, when ip_top is NULL, the coredump worker crashes while > >>> taking the spinlock for ip_top->die_kset. > >>> > >>> Fix this by checking for a missing ip_top before walking the > >>> discovery topology. If it is unavailable, print a short message in > >>> the dump and return safely. > >>> > >>> - If ip_top is NULL, print a message and skip the dump > >>> - Also add the same check in the cleanup path > >>> > >>> This makes the coredump and cleanup paths safe even when the > >>> discovery topology is not available. > >>> > >>> KASAN trace: > >>> [ 522.228252] [IGT] amd_deadlock: starting subtest > >>> amdgpu-deadlock-sdma [ 522.240681] [IGT] amd_deadlock: starting > >>> dynamic subtest amdgpu-deadlock-sdma > >>> > >>> ... > >>> > >>> [ 522.952317] Write of size 4 at addr 0000000000000050 by task > >>> kworker/u129:5/5434 [ 522.937526] BUG: KASAN: null-ptr-deref in > >>> _raw_spin_lock+0x66/0xc0 [ 522.967659] Workqueue: events_unbound > >>> amdgpu_devcoredump_deferred_work [amdgpu] > >>> > >>> ... > >>> > >>> [ 522.969445] Call Trace: > >>> [ 522.969508] _raw_spin_lock+0x66/0xc0 [ 522.969518] ? > >>> __pfx__raw_spin_lock+0x10/0x10 [ 522.969534] > >>> amdgpu_discovery_dump+0x61/0x530 [amdgpu] [ 522.971346] ? > >>> pick_next_task_fair+0x3f6/0x1c60 [ 522.971363] > >>> amdgpu_devcoredump_format+0x84f/0x26f0 [amdgpu] [ 522.973188] ? > >>> __pfx_amdgpu_devcoredump_format+0x10/0x10 [amdgpu] [ 522.975012] ? > >>> psi_task_switch+0x2b5/0x9b0 [ 522.975027] ? > >>> __pfx___drm_printfn_coredump+0x10/0x10 [drm] [ 522.975198] ? > >>> __pfx___drm_puts_coredump+0x10/0x10 [drm] [ 522.975366] ? > >>> __schedule+0x113c/0x38d0 [ 522.975381] > >>> amdgpu_devcoredump_deferred_work+0x4c/0x1f0 [amdgpu] > >>> > >>> v2: Updated commit message - Clarified that ip_top is not freed, it can > >>> just be NULL if discovery was not initialized. > >>> (Christian/Lijo) > >>> > >>> v3: Removed the extra drm_warn() for sysfs init failure as sysfs already > >>> reports errors. (Christian) > >>> > >>> Fixes: 7083eb8982fb ("drm/amdgpu: include ip discovery data in > >>> devcoredump") > >>> Cc: Pierre-Eric Pelloux-Prayer <[email protected]> > >>> Cc: Christian König <[email protected]> > >>> Cc: Alex Deucher <[email protected]> > >>> Signed-off-by: Srinivasan Shanmugam <[email protected]> > >>> --- > >>> drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c | 11 ++++++++++- > >>> 1 file changed, 10 insertions(+), 1 deletion(-) > >>> > >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c > >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c > >>> index f7f37d93d0ce..6be1f971a31a 100644 > >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c > >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c > >>> @@ -1395,6 +1395,9 @@ static void amdgpu_discovery_sysfs_fini(struct > >> amdgpu_device *adev) > >>> struct list_head *el, *tmp; > >>> struct kset *die_kset; > >>> > >>> + if (!ip_top) > >>> + return; > >>> + > >>> die_kset = &ip_top->die_kset; > >>> spin_lock(&die_kset->list_lock); > >>> list_for_each_prev_safe(el, tmp, &die_kset->list) { @@ -1419,9 > >>> +1422,15 @@ void amdgpu_discovery_dump(struct amdgpu_device *adev, > >> struct drm_printer *p) > >>> struct ip_hw_instance *ip_inst; > >>> int i = 0, j; > >>> > >>> + drm_printf(p, "\nHW IP Discovery\n"); > >>> + > >>> + if (!ip_top) { > >>> + drm_printf(p, "ip discovery topology unavailable\n"); > >> > >> Is this type of printing really required or just skipping the whole > >> section good enough? > > > > > > Silently skipping the rest may look like incomplete or missing data in > > the coredump. > > > > Adding a one-line message makes it clear that the topology was not > > available, rather than leaving an empty section. > > > > Here is my take - Discovery is the basic requirement for SOCs which make use > of > that mechanism and it is always expected to be present for those, otherwise > driver > load will fail. > > For those which don't make use of discovery, the section will not be present. > There > is no special message required for that. There is no harm to keep that, it > only adds > extra parsing.
It's ok for normal cases. However, this runs in the *coredump/debug path during error conditions* - where the discovery topology may not be initialized. instead of leaving the section empty, which can be confusing during debugging. Best, Srini
