On 3/24/26 02:26, Jesse.Zhang wrote:
> During GPU reset coredump generation, amdgpu_devcoredump_fw_info()
> unconditionally
> dereferences adev->mode_info.atom_context to print VBIOS fields. On
> reset/teardown
> paths this pointer can be NULL, causing a kernel page fault from the deferred
> coredump workqueue.
>
> Fix by checking ctx before printing VBIOS fields:
>
> if ctx is valid, print full VBIOS information as before;
> This prevents NULL-dereference crashes while preserving coredump output.
>
> Observed page fault log:
> [ 667.933329] RIP: 0010:amdgpu_devcoredump_format+0x780/0xc00 [amdgpu]
> [ 667.941517] amdgpu 0002:01:00.0: Dumping IP State
> [ 667.949660] Code: 8d 57 74 48 c7 c6 01 65 9f c2 48 8d 7d 98 e8 97 96 7a ff
> 49 8d 97 b4 00 00 00 48 c7 c6 18 65 9f c2 48 8d 7d 98 e8 80 96 7a ff <41> 8b
> 97 f4 00 00 00 48 c7 c6 2f 65 9f c2 48 8d 7d 98 e8 69 96 7a
> [ 667.949666] RSP: 0018:ffffc9002302bd50 EFLAGS: 00010246
> [ 667.949673] RAX: 0000000000000000 RBX: ffff888110600000 RCX:
> 0000000000000000
> [ 667.949676] RDX: 000000000000a9b5 RSI: 0000000000000405 RDI:
> 000000000000a999
> [ 667.949680] RBP: ffffc9002302be00 R08: ffffffffc09c3084 R09:
> ffffffffc09c3085
> [ 667.949684] R10: 0000000000000000 R11: 0000000000000004 R12:
> 00000000000048e0
> [ 667.993908] amdgpu 0002:01:00.0: Dumping IP State Completed
> [ 667.994229] R13: 0000000000000025 R14: 000000000000000c R15:
> 0000000000000000
> [ 667.994233] FS: 0000000000000000(0000) GS:ffff88c44c2c9000(0000)
> knlGS:0000000000000000
> [ 668.000076] amdgpu 0002:01:00.0: [drm] AMDGPU device coredump file has
> been created
> [ 668.008025] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 668.008030] CR2: 00000000000000f4 CR3: 000000011195f001 CR4:
> 0000000000770ef0
> [ 668.008035] PKRU: 55555554
> [ 668.008040] Call Trace:
> [ 668.008045] <TASK>
> [ 668.016010] amdgpu 0002:01:00.0: [drm] Check your
> /sys/class/drm/card16/device/devcoredump/data
> [ 668.023967] ? srso_alias_return_thunk+0x5/0xfbef5
> [ 668.023988] ? __pfx___drm_printfn_coredump+0x10/0x10 [drm]
> [ 668.031950] amdgpu 0003:01:00.0: Dumping IP State
> [ 668.038159] ? __pfx___drm_puts_coredump+0x10/0x10 [drm]
> [ 668.083017] amdgpu 0003:01:00.0: Dumping IP State Completed
> [ 668.083824] amdgpu_devcoredump_deferred_work+0x26/0xc0 [amdgpu]
> [ 668.086163] amdgpu 0003:01:00.0: [drm] AMDGPU device coredump file has
> been created
> [ 668.095863] process_scheduled_works+0xa6/0x420
> [ 668.095880] worker_thread+0x12a/0x270
> [ 668.101223] amdgpu 0003:01:00.0: [drm] Check your
> /sys/class/drm/card24/device/devcoredump/data
> [ 668.107441] kthread+0x10d/0x230
> [ 668.107451] ? __pfx_worker_thread+0x10/0x10
> [ 668.107458] ? __pfx_kthread+0x10/0x10
> [ 668.112709] amdgpu 0000:01:00.0: ring vcn_unified_1 timeout, signaled
> seq=9, emitted seq=10
> [ 668.118630] ret_from_fork+0x17c/0x1f0
> [ 668.118640] ? __pfx_kthread+0x10/0x10
> [ 668.118647] ret_from_fork_asm+0x1a/0x30
>
> v4: fix the race concern without introducing VBIOS snapshot state.
>
> Suggested-by: Lijo Lazar <[email protected]>
> Signed-off-by: Jesse Zhang <[email protected]>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c | 16 ++++++++++------
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++++
> 2 files changed, 14 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c
> index bbb5afd67b49..5aa46480f05f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c
> @@ -192,12 +192,16 @@ static void amdgpu_devcoredump_fw_info(struct
> amdgpu_device *adev,
> drm_printf(p, "VPE feature version: %u, fw version: 0x%08x\n",
> adev->vpe.feature_version, adev->vpe.fw_version);
>
> - drm_printf(p, "\nVBIOS Information\n");
> - drm_printf(p, "vbios name : %s\n", ctx->name);
> - drm_printf(p, "vbios pn : %s\n", ctx->vbios_pn);
> - drm_printf(p, "vbios version : %d\n", ctx->version);
> - drm_printf(p, "vbios ver_str : %s\n", ctx->vbios_ver_str);
> - drm_printf(p, "vbios date : %s\n", ctx->date);
> + if (adev->bios) {
> + drm_printf(p, "\nVBIOS Information\n");
> + drm_printf(p, "vbios name : %s\n", ctx->name);
> + drm_printf(p, "vbios pn : %s\n", ctx->vbios_pn);
> + drm_printf(p, "vbios version : %d\n", ctx->version);
> + drm_printf(p, "vbios ver_str : %s\n", ctx->vbios_ver_str);
> + drm_printf(p, "vbios date : %s\n", ctx->date);
> + }else {
> + drm_printf(p, "\nVBIOS Information: NA\n");
> + }
> }
>
> static ssize_t
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index fbe553c38583..69f4549e6271 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4260,6 +4260,10 @@ void amdgpu_device_fini_sw(struct amdgpu_device *adev)
> /* free i2c buses */
> amdgpu_i2c_fini(adev);
>
> +#ifdef CONFIG_DEV_COREDUMP
> + /* Make sure deferred coredump formatting is done before tearing down
> VBIOS/ATOM. */
> + flush_work(&adev->coredump_work);
> +#endif
Looks correct to me of hand, but I'm not very familiar with this part of the
code.
@Pierre-Eric and @Sunil can you take a look as well? You two have done more
with devcoredump then me.
Thanks,
Christian.
> if (adev->bios) {
> if (amdgpu_emu_mode != 1)
> amdgpu_atombios_fini(adev);