[Kernel-packages] [Bug 1981883] Re: amdgpu module crash after 5.15 kernel update

Henry Goffin Sat, 16 Jul 2022 03:55:56 -0700

This crash occurs after also firing several warnings with backtraces -

[ 318.108644] WARNING: CPU: 6 PID: 13667 at
drivers/gpu/drm/drm_crtc_helper.c:221
drm_helper_disable_unused_functions+0x32/0x50 [drm_kms_helper]


[ 318.109727] WARNING: CPU: 6 PID: 13667 at
drivers/gpu/drm/drm_crtc_helper.c:101
drm_helper_encoder_in_use+0x4d/0xe0 [drm_kms_helper]

[ 318.110742] WARNING: CPU: 6 PID: 13667 at
drivers/gpu/drm/drm_crtc_helper.c:141 drm_helper_crtc_in_use+0x3c/0xb0
[drm_kms_helper]

All of these warnings are the same code check, which warns when the
driver reports atomic modesetting but calls legacy modesetting
functions. Beyond that summary, I am way out of my depth, and someone
from AMD will probably have to untangle this for 5.15 (later in 5.17
this entire custom fbdev implementation was removed and replaced with
common code).

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux-meta-aws-5.15 in Ubuntu.
https://bugs.launchpad.net/bugs/1981883

Title:
  amdgpu module crash after 5.15 kernel update

Status in linux-meta-aws-5.15 package in Ubuntu:
  New

Bug description:
  The kernel 5.15 amdgpu module crashes on load with a “BUG: kernel NULL
  pointer dereference” on Amazon EC2 G4ad hardware (custom AMD Radeon
  V520 Pro datacenter GPU) on focal (HWE) and jammy with kernel
  5.15.0-1011 and possibly earlier, up through latest (revision
  1015.19). This crash bug did not exist in any of the focal HWE 5.13
  kernels.

  This is probably an upstream kernel bug, but I am also filing it here
  because existing focal users on EC2 will suddenly stop having access
  to their AMD GPUs after a reboot once the new 5.15 HWE kernel is
  installed.

  The full backtrace from dmesg is below. The offending function call
  which crashes in the 5.15 kernel corresponds to this source (sorry,
  not the right source tree, but the same driver)
  
https://github.com/torvalds/linux/blob/8bb7eca972ad531c9b149c0a51ab43a417385813/drivers/gpu/drm/amd/amdgpu/amdgpu_fb.c#L345

  A workaround that I have discovered is adding “options amdgpu
  virtual_display=;” to a new modprobe.d configuration file - something
  which shouldn’t be required, but is at least harmless.

  
  Here is the relevant BUG message and backtrace from dmesg:

  [  318.111721] BUG: kernel NULL pointer dereference, address: 0000000000000000
  [  318.115443] #PF: supervisor instruction fetch in kernel mode
  [  318.118443] #PF: error_code(0x0010) - not-present page
  [  318.121177] PGD 0 P4D 0
  [  318.122688] Oops: 0010 [#1] SMP NOPTI
  [  318.124592] CPU: 6 PID: 13667 Comm: modprobe Tainted: G        W         
5.15.0-1015-aws #19~20.04.1-Ubuntu
  [  318.129711] Hardware name: Amazon EC2 g4ad.2xlarge/, BIOS 1.0 10/16/2017
  [  318.133167] RIP: 0010:0x0
  [  318.134704] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
  [  318.138291] RSP: 0018:ffff9841828d78e0 EFLAGS: 00010246
  [  318.140938] RAX: 0000000000000000 RBX: ffff8a4f16ae8000 RCX: 
0000000000000001
  [  318.144604] RDX: 0000000000000000 RSI: 0000000000000003 RDI: 
ffff8a4f16ae8000
  [  318.148319] RBP: ffff9841828d7908 R08: ffff8a4f02460278 R09: 
ffff8a4f06422c40
  [  318.152151] R10: c01c42494f8affff R11: ffff8a4f01dcb5b8 R12: 
ffff8a4f024602e8
  [  318.155929] R13: ffffffffc107e4a0 R14: 0000000000000000 R15: 
ffff8a4f02460010
  [  318.159685] FS:  00007f8afdc1c740(0000) GS:ffff8a55f0980000(0000) 
knlGS:0000000000000000
  [  318.163897] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [  318.167038] CR2: ffffffffffffffd6 CR3: 000000011817e000 CR4: 
00000000003506e0
  [  318.170758] Call Trace:
  [  318.173964]  <TASK>
  [  318.176822]  __drm_helper_disable_unused_functions+0xe7/0x100 
[drm_kms_helper]
  [  318.184230]  drm_helper_disable_unused_functions+0x44/0x50 [drm_kms_helper]
  [  318.189761]  amdgpu_fbdev_init+0x104/0x110 [amdgpu]
  [  318.194264]  amdgpu_device_init.cold+0x7cc/0xc48 [amdgpu]
  [  318.199061]  ? pci_read_config_byte+0x27/0x40
  [  318.203206]  amdgpu_driver_load_kms+0x1e/0x270 [amdgpu]
  [  318.207901]  amdgpu_pci_probe+0x1ea/0x290 [amdgpu]
  [  318.212445]  local_pci_probe+0x4b/0x90
  [  318.216386]  pci_device_probe+0x182/0x1f0
  [  318.220407]  really_probe.part.0+0xcb/0x370
  [  318.224460]  really_probe+0x40/0x80
  [  318.228232]  __driver_probe_device+0x115/0x190
  [  318.232412]  driver_probe_device+0x23/0xa0
  [  318.236436]  __driver_attach+0xbd/0x160
  [  318.240348]  ? __device_attach_driver+0x110/0x110
  [  318.244637]  bus_for_each_dev+0x7e/0xc0
  [  318.248570]  driver_attach+0x1e/0x20
  [  318.252474]  bus_add_driver+0x161/0x200
  [  318.256412]  driver_register+0x74/0xd0
  [  318.260332]  __pci_register_driver+0x68/0x70
  [  318.264496]  amdgpu_init+0x7c/0x1000 [amdgpu]
  [  318.268841]  ? 0xffffffffc146e000
  [  318.272521]  do_one_initcall+0x48/0x1d0
  [  318.276446]  ? __cond_resched+0x19/0x30
  [  318.280378]  ? kmem_cache_alloc_trace+0x15a/0x420
  [  318.284736]  do_init_module+0x52/0x230
  [  318.288644]  load_module+0x1372/0x1600
  [  318.292529]  __do_sys_finit_module+0xbf/0x120
  [  318.296706]  ? __do_sys_finit_module+0xbf/0x120
  [  318.300947]  __x64_sys_finit_module+0x1a/0x20
  [  318.305265]  do_syscall_64+0x5c/0xc0
  [  318.308984]  ? do_syscall_64+0x69/0xc0
  [  318.312841]  ? do_syscall_64+0x69/0xc0
  [  318.316693]  ? do_syscall_64+0x69/0xc0
  [  318.320545]  entry_SYSCALL_64_after_hwframe+0x44/0xae
  [  318.324975] RIP: 0033:0x7f8afdd6273d
  [  318.328700] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 
f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 
f0 ff ff 73 01 c3 48 8b 0d 23 37 0d 00 f7 d8 64 89 01 48
  [  318.343601] RSP: 002b:00007ffe48e5c8c8 EFLAGS: 00000246 ORIG_RAX: 
0000000000000139
  [  318.351197] RAX: ffffffffffffffda RBX: 0000560c79e39260 RCX: 
00007f8afdd6273d
  [  318.356678] RDX: 0000000000000000 RSI: 0000560c78e96358 RDI: 
000000000000000f
  [  318.362340] RBP: 0000000000040000 R08: 0000000000000000 R09: 
0000000000000000
  [  318.367743] R10: 000000000000000f R11: 0000000000000246 R12: 
0000560c78e96358
  [  318.373241] R13: 0000000000000000 R14: 0000560c79e39390 R15: 
0000560c79e39260
  [  318.378699]  </TASK>
  [  318.381783] Modules linked in: amdgpu(+) iommu_v2 gpu_sched drm_ttm_helper 
ttm drm_kms_helper cec rc_core i2c_algo_bit fb_sys_fops syscopyarea sysfillrect 
sysimgblt btrfs blake2b_generic xor zstd_compress raid6_pq ufs qnx4 hfsplus hfs 
minix ntfs msdos jfs xfs libcrc32c nls_iso8859_1 dm_multipath scsi_dh_rdac 
scsi_dh_emc scsi_dh_alua ppdev crct10dif_pclmul crc32_pclmul 
ghash_clmulni_intel aesni_intel psmouse crypto_simd parport_pc input_leds 
parport cryptd ena serio_raw sch_fq_codel ipmi_devintf ipmi_msghandler msr drm 
ip_tables x_tables autofs4
  [  318.418449] CR2: 0000000000000000
  [  318.422130] ---[ end trace d6b9efffe55f5322 ]---
  [  318.426391] RIP: 0010:0x0
  [  318.429681] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
  [  318.434994] RSP: 0018:ffff9841828d78e0 EFLAGS: 00010246
  [  318.439489] RAX: 0000000000000000 RBX: ffff8a4f16ae8000 RCX: 
0000000000000001
  [  318.444896] RDX: 0000000000000000 RSI: 0000000000000003 RDI: 
ffff8a4f16ae8000
  [  318.450518] RBP: ffff9841828d7908 R08: ffff8a4f02460278 R09: 
ffff8a4f06422c40
  [  318.455937] R10: c01c42494f8affff R11: ffff8a4f01dcb5b8 R12: 
ffff8a4f024602e8
  [  318.461581] R13: ffffffffc107e4a0 R14: 0000000000000000 R15: 
ffff8a4f02460010
  [  318.466999] FS:  00007f8afdc1c740(0000) GS:ffff8a55f0980000(0000) 
knlGS:0000000000000000
  [  318.474744] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [  318.479525] CR2: ffffffffffffffd6 CR3: 000000011817e000 CR4: 
00000000003506e0

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-meta-aws-5.15/+bug/1981883/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1981883] Re: amdgpu module crash after 5.15 kernel update

Reply via email to