Le 24/03/2026 à 13:38, Christian König a écrit :
On 3/24/26 11:28, Lazar, Lijo wrote:


On 24-Mar-26 3:43 PM, Pierre-Eric Pelloux-Prayer wrote:


Le 24/03/2026 à 10:42, Christian König a écrit :
On 3/24/26 02:26, Jesse.Zhang wrote:
During GPU reset coredump generation, amdgpu_devcoredump_fw_info() 
unconditionally
dereferences adev->mode_info.atom_context to print VBIOS fields. On 
reset/teardown
paths this pointer can be NULL, causing a kernel page fault from the deferred
coredump workqueue.

Fix by checking ctx before printing VBIOS fields:

if ctx is valid, print full VBIOS information as before;
This prevents NULL-dereference crashes while preserving coredump output.

Observed page fault log:
[  667.933329] RIP: 0010:amdgpu_devcoredump_format+0x780/0xc00 [amdgpu]
[  667.941517] amdgpu 0002:01:00.0: Dumping IP State
[  667.949660] Code: 8d 57 74 48 c7 c6 01 65 9f c2 48 8d 7d 98 e8 97 96 7a ff 49 8d 
97 b4 00 00 00 48 c7 c6 18 65 9f c2 48 8d 7d 98 e8 80 96 7a ff <41> 8b 97 f4 00 
00 00 48 c7 c6 2f 65 9f c2 48 8d 7d 98 e8 69 96 7a
[  667.949666] RSP: 0018:ffffc9002302bd50 EFLAGS: 00010246
[  667.949673] RAX: 0000000000000000 RBX: ffff888110600000 RCX: 0000000000000000
[  667.949676] RDX: 000000000000a9b5 RSI: 0000000000000405 RDI: 000000000000a999
[  667.949680] RBP: ffffc9002302be00 R08: ffffffffc09c3084 R09: ffffffffc09c3085
[  667.949684] R10: 0000000000000000 R11: 0000000000000004 R12: 00000000000048e0
[  667.993908] amdgpu 0002:01:00.0: Dumping IP State Completed
[  667.994229] R13: 0000000000000025 R14: 000000000000000c R15: 0000000000000000
[  667.994233] FS:  0000000000000000(0000) GS:ffff88c44c2c9000(0000) 
knlGS:0000000000000000
[  668.000076] amdgpu 0002:01:00.0: [drm] AMDGPU device coredump file has been 
created
[  668.008025] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  668.008030] CR2: 00000000000000f4 CR3: 000000011195f001 CR4: 0000000000770ef0
[  668.008035] PKRU: 55555554
[  668.008040] Call Trace:
[  668.008045]  <TASK>
[  668.016010] amdgpu 0002:01:00.0: [drm] Check your /sys/class/drm/ 
card16/device/devcoredump/data
[  668.023967]  ? srso_alias_return_thunk+0x5/0xfbef5
[  668.023988]  ? __pfx___drm_printfn_coredump+0x10/0x10 [drm]
[  668.031950] amdgpu 0003:01:00.0: Dumping IP State
[  668.038159]  ? __pfx___drm_puts_coredump+0x10/0x10 [drm]
[  668.083017] amdgpu 0003:01:00.0: Dumping IP State Completed
[  668.083824]  amdgpu_devcoredump_deferred_work+0x26/0xc0 [amdgpu]
[  668.086163] amdgpu 0003:01:00.0: [drm] AMDGPU device coredump file has been 
created
[  668.095863]  process_scheduled_works+0xa6/0x420
[  668.095880]  worker_thread+0x12a/0x270
[  668.101223] amdgpu 0003:01:00.0: [drm] Check your /sys/class/drm/ 
card24/device/devcoredump/data
[  668.107441]  kthread+0x10d/0x230
[  668.107451]  ? __pfx_worker_thread+0x10/0x10
[  668.107458]  ? __pfx_kthread+0x10/0x10
[  668.112709] amdgpu 0000:01:00.0: ring vcn_unified_1 timeout, signaled seq=9, 
emitted seq=10
[  668.118630]  ret_from_fork+0x17c/0x1f0
[  668.118640]  ? __pfx_kthread+0x10/0x10
[  668.118647]  ret_from_fork_asm+0x1a/0x30

v4: fix the race concern without introducing VBIOS snapshot state.

Suggested-by: Lijo Lazar <[email protected]>
Signed-off-by: Jesse Zhang <[email protected]>
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c | 16 ++++++++++------
   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c       |  4 ++++
   2 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c b/ 
drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c
index bbb5afd67b49..5aa46480f05f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c
@@ -192,12 +192,16 @@ static void amdgpu_devcoredump_fw_info(struct 
amdgpu_device *adev,
       drm_printf(p, "VPE feature version: %u, fw version: 0x%08x\n",
              adev->vpe.feature_version, adev->vpe.fw_version);
-    drm_printf(p, "\nVBIOS Information\n");
-    drm_printf(p, "vbios name       : %s\n", ctx->name);
-    drm_printf(p, "vbios pn         : %s\n", ctx->vbios_pn);
-    drm_printf(p, "vbios version    : %d\n", ctx->version);
-    drm_printf(p, "vbios ver_str    : %s\n", ctx->vbios_ver_str);
-    drm_printf(p, "vbios date       : %s\n", ctx->date);
+    if (adev->bios) {
+        drm_printf(p, "\nVBIOS Information\n");
+        drm_printf(p, "vbios name       : %s\n", ctx->name);
+        drm_printf(p, "vbios pn         : %s\n", ctx->vbios_pn);
+        drm_printf(p, "vbios version    : %d\n", ctx->version);
+        drm_printf(p, "vbios ver_str    : %s\n", ctx->vbios_ver_str);
+        drm_printf(p, "vbios date       : %s\n", ctx->date);
+    }else {
+        drm_printf(p, "\nVBIOS Information: NA\n");
+    }
   }
   static ssize_t
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/ 
gpu/drm/amd/amdgpu/amdgpu_device.c
index fbe553c38583..69f4549e6271 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4260,6 +4260,10 @@ void amdgpu_device_fini_sw(struct amdgpu_device *adev)
       /* free i2c buses */
       amdgpu_i2c_fini(adev);
+#ifdef CONFIG_DEV_COREDUMP
+    /* Make sure deferred coredump formatting is done before tearing down 
VBIOS/ATOM. */
+    flush_work(&adev->coredump_work);
+#endif

Looks correct to me of hand, but I'm not very familiar with this part of the 
code.

@Pierre-Eric and @Sunil can you take a look as well? You two have done more 
with devcoredump then me.

The worker thread doesn't access the HW but it still reads some things from the 
adev pointer.
Ideally, anything the worker needs should be copied to amdgpu_coredump_info 
from amdgpu_coredump.
Then the worker would only ever access its own state.


Hi Pierre,

amdgpu_devcoredump_deferred_work -> amdgpu_devcoredump_format() -> 
ip_block->version->funcs->print_ip_state(ip_block, &p);

This accesses hw state.

A newer version of the patch is posted which moves flush work inside fini hw 
befor IP tear down.

That sounds reasonable to me.

+1


I mean we don't need to copy static information into amdgpu_coredump_info (e.g. 
like BIOS info), but things like print_ip_state() clearly needs to execute 
outside of the worker since the dumped HW state is destroyed by the GPU reset.

Reading hw state from the worker causes potentially outdated data to be read.

Previously this was done in amdgpu_decoredump_read which was called when someone would read the devcoredump - this was wrong for the same reasons.

Doing it from amdgpu_coredump requires memory to be pre-allocated to store each IP' state because we can't allocate memory in this context. Then the stored IP state would be formatted by the worker.

Pierre-Eric


Regards,
Christian.


Thanks,
Lijo

That being said, flushing the worker is fine for now.

Pierre-Eric


Thanks,
Christian.


       if (adev->bios) {
           if (amdgpu_emu_mode != 1)
               amdgpu_atombios_fini(adev);

Reply via email to