[AMD Official Use Only - AMD Internal Distribution Only]

> -----Original Message-----
> From: Lazar, Lijo <[email protected]>
> Sent: Tuesday, March 24, 2026 2:38 PM
> To: Zhang, Jesse(Jie) <[email protected]>; [email protected]
> Cc: Deucher, Alexander <[email protected]>; Koenig, Christian
> <[email protected]>
> Subject: Re: [PATCH v4] drm/amdgpu: guard atom_context in devcoredump VBIOS
> dump
>
>
>
> On 24-Mar-26 6:56 AM, Jesse.Zhang wrote:
> > During GPU reset coredump generation, amdgpu_devcoredump_fw_info()
> > unconditionally dereferences adev->mode_info.atom_context to print
> > VBIOS fields. On reset/teardown paths this pointer can be NULL,
> > causing a kernel page fault from the deferred coredump workqueue.
> >
> > Fix by checking ctx before printing VBIOS fields:
> >
> > if ctx is valid, print full VBIOS information as before; This prevents
> > NULL-dereference crashes while preserving coredump output.
> >
> > Observed page fault log:
> > [  667.933329] RIP: 0010:amdgpu_devcoredump_format+0x780/0xc00
> > [amdgpu] [  667.941517] amdgpu 0002:01:00.0: Dumping IP State [
> > 667.949660] Code: 8d 57 74 48 c7 c6 01 65 9f c2 48 8d 7d 98 e8 97 96
> > 7a ff 49 8d 97 b4 00 00 00 48 c7 c6 18 65 9f c2 48 8d 7d 98 e8 80 96
> > 7a ff <41> 8b 97 f4 00 00 00 48 c7 c6 2f 65 9f c2 48 8d 7d 98 e8 69 96
> > 7a [  667.949666] RSP: 0018:ffffc9002302bd50 EFLAGS: 00010246 [
> > 667.949673] RAX: 0000000000000000 RBX: ffff888110600000 RCX:
> > 0000000000000000 [  667.949676] RDX: 000000000000a9b5 RSI:
> > 0000000000000405 RDI: 000000000000a999 [  667.949680] RBP:
> > ffffc9002302be00 R08: ffffffffc09c3084 R09: ffffffffc09c3085 [
> > 667.949684] R10: 0000000000000000 R11: 0000000000000004 R12:
> > 00000000000048e0 [  667.993908] amdgpu 0002:01:00.0: Dumping IP State
> > Completed [  667.994229] R13: 0000000000000025 R14: 000000000000000c
> > R15: 0000000000000000 [  667.994233] FS:  0000000000000000(0000)
> > GS:ffff88c44c2c9000(0000) knlGS:0000000000000000 [  668.000076] amdgpu
> > 0002:01:00.0: [drm] AMDGPU device coredump file has been created
> [  668.008025] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  668.008030] CR2: 00000000000000f4 CR3: 000000011195f001 CR4:
> 0000000000770ef0 [  668.008035] PKRU: 55555554 [  668.008040] Call Trace:
> > [  668.008045]  <TASK>
> > [  668.016010] amdgpu 0002:01:00.0: [drm] Check your
> > /sys/class/drm/card16/device/devcoredump/data
> > [  668.023967]  ? srso_alias_return_thunk+0x5/0xfbef5
> > [  668.023988]  ? __pfx___drm_printfn_coredump+0x10/0x10 [drm] [
> > 668.031950] amdgpu 0003:01:00.0: Dumping IP State [  668.038159]  ?
> > __pfx___drm_puts_coredump+0x10/0x10 [drm] [  668.083017] amdgpu
> > 0003:01:00.0: Dumping IP State Completed [  668.083824]
> > amdgpu_devcoredump_deferred_work+0x26/0xc0 [amdgpu] [  668.086163]
> > amdgpu 0003:01:00.0: [drm] AMDGPU device coredump file has been
> > created [  668.095863]  process_scheduled_works+0xa6/0x420
> > [  668.095880]  worker_thread+0x12a/0x270 [  668.101223] amdgpu
> > 0003:01:00.0: [drm] Check your
> > /sys/class/drm/card24/device/devcoredump/data
> > [  668.107441]  kthread+0x10d/0x230
> > [  668.107451]  ? __pfx_worker_thread+0x10/0x10 [  668.107458]  ?
> > __pfx_kthread+0x10/0x10 [  668.112709] amdgpu 0000:01:00.0: ring
> > vcn_unified_1 timeout, signaled seq=9, emitted seq=10 [  668.118630]
> > ret_from_fork+0x17c/0x1f0 [  668.118640]  ? __pfx_kthread+0x10/0x10 [
> > 668.118647]  ret_from_fork_asm+0x1a/0x30
> >
> > v4: fix the race concern without introducing VBIOS snapshot state.
> >
> > Suggested-by: Lijo Lazar <[email protected]>
> > Signed-off-by: Jesse Zhang <[email protected]>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c | 16 ++++++++++------
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c       |  4 ++++
> >   2 files changed, 14 insertions(+), 6 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c
> > index bbb5afd67b49..5aa46480f05f 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c
> > @@ -192,12 +192,16 @@ static void amdgpu_devcoredump_fw_info(struct
> amdgpu_device *adev,
> >     drm_printf(p, "VPE feature version: %u, fw version: 0x%08x\n",
> >                adev->vpe.feature_version, adev->vpe.fw_version);
> >
> > -   drm_printf(p, "\nVBIOS Information\n");
> > -   drm_printf(p, "vbios name       : %s\n", ctx->name);
> > -   drm_printf(p, "vbios pn         : %s\n", ctx->vbios_pn);
> > -   drm_printf(p, "vbios version    : %d\n", ctx->version);
> > -   drm_printf(p, "vbios ver_str    : %s\n", ctx->vbios_ver_str);
> > -   drm_printf(p, "vbios date       : %s\n", ctx->date);
> > +   if (adev->bios) {
> > +           drm_printf(p, "\nVBIOS Information\n");
> > +           drm_printf(p, "vbios name       : %s\n", ctx->name);
> > +           drm_printf(p, "vbios pn         : %s\n", ctx->vbios_pn);
> > +           drm_printf(p, "vbios version    : %d\n", ctx->version);
> > +           drm_printf(p, "vbios ver_str    : %s\n", ctx->vbios_ver_str);
> > +           drm_printf(p, "vbios date       : %s\n", ctx->date);
> > +   }else {
> > +           drm_printf(p, "\nVBIOS Information: NA\n");
> > +   }
> >   }
> >
> >   static ssize_t
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index fbe553c38583..69f4549e6271 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -4260,6 +4260,10 @@ void amdgpu_device_fini_sw(struct amdgpu_device
> *adev)
> >     /* free i2c buses */
> >     amdgpu_i2c_fini(adev);
> >
> > +#ifdef CONFIG_DEV_COREDUMP
> > +   /* Make sure deferred coredump formatting is done before tearing down
> VBIOS/ATOM. */
> > +   flush_work(&adev->coredump_work);
>
> Looking further into coredump, it also prints different IP states. Then it 
> needs to be
> finished in amdgpu_device_fini_hw before hw state tear-down.
>
> BTW, suggest to keep the flush work as a separate patch since that is more 
> related
> to proper handling of a deferred coredump work during unload. No-vbios 
> problem is
> different from that and it can be a different patch.

Thanks Lijo  for the review and suggestion.

Agreed. I will split this into two patches:

1.NULL-check fix in amdgpu_devcoredump_fw_info() only, to avoid dereferencing 
adev->mode_info.atom_context when it is NULL.
2. Deferred coredump work ordering fix as a separate patch, moving the 
flush_work(&adev->coredump_work) handling to amdgpu_device_fini_hw() before 
HW/IP teardown.

Thanks
Jesse
>
> Thanks,
> Lijo
>
> > +#endif
> >     if (adev->bios) {
> >             if (amdgpu_emu_mode != 1)
> >                     amdgpu_atombios_fini(adev);

Reply via email to