[AMD Official Use Only - AMD Internal Distribution Only] Hi Mike,
Thanks for the details. We tried to repro the issue at our end on 9000 and 7000 series dgpu, but we're not seeing the dmub errors. We were on Ubunti, so we'll try Fedora. -- Regards, Jay ________________________________ From: Mikhail Gavrilov <[email protected]> Sent: Wednesday, May 21, 2025 12:40 PM To: Pillai, Aurabindo <[email protected]> Cc: Chung, ChiaHsuan (Tom) <[email protected]>; Wu, Ray <[email protected]>; Wheeler, Daniel <[email protected]>; Deucher, Alexander <[email protected]>; amd-gfx list <[email protected]>; dri-devel <[email protected]>; Linux List Kernel Mailing <[email protected]>; Linux regressions mailing list <[email protected]> Subject: Re: 6.15-rc6/regression/bisected - after commit f1c6be3999d2 error appeared: *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error On Tue, May 20, 2025 at 9:22 PM Mikhail Gavrilov <[email protected]> wrote: > > > Could you more details about your setup, and how you were able to repro it ? > > Hi, Were you able to reproduce the issue? I’ve prepared a step-by-step guide that may help: 1. Set up a system with a Radeon 6900XT and an LG TV connected via HDMI. 2. Install Fedora Rawhide. 3. Build and install kernel 6.15-rc7 using my .config (attached in the first message). 4. Boot into the custom-built kernel. 5. Set the display resolution to 3840×2160 @ 120 Hz. (This step is optional but may help trigger the issue faster.) 6. Generate heavy system load. I use an infinite kernel rebuild loop: <fish shell> > for i in (seq 1 400000); make clean && make -j32 bzImage && make -j32 > modules; end </fish shell> Expected behavior: System remains stable during heavy load. Actual behavior: 1. First, the kernel log is filled with repeated messages: amdgpu 0000:03:00.0: amdgpu: [drm] DP AUX transfer fail:4 2. After a short while under load, more severe errors appear: amdgpu 0000:03:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data 3. Finally, the system completely freezes with a hard lockup: watchdog: CPU28: Watchdog detected hard LOCKUP on cpu 28 Modules linked in: uinput rfcomm snd_seq_dummy snd_hrtimer nft_queue nfnetlink_queue nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables qrtr bnep sunrpc binfmt_misc amd_atl intel_rapl_msr intel_rapl_common edac_mce_amd btusb snd_hda_codec_realtek btrtl mt7921e btintel mt7921_common snd_hda_codec_generic btbcm mt792x_lib btmtk snd_hda_scodec_component snd_hda_codec_hdmi snd_usb_audio mt76_connac_lib kvm_amd snd_hda_intel bluetooth mt76 snd_intel_dspcfg snd_usbmidi_lib snd_intel_sdw_acpi snd_hda_codec mc kvm spd5118 mac80211 snd_ump snd_hda_core snd_rawmidi snd_hwdep vfat irqbypass fat snd_seq snd_seq_device wmi_bmof libarc4 rapl r8169 pcspkr snd_pcm cfg80211 i2c_piix4 snd_timer k10temp i2c_smbus realtek snd rfkill joydev soundcore gpio_amdpt gpio_generic loop nfnetlink zram lz4hc_compress lz4_compress amdgpu amdxcp i2c_algo_bit drm_ttm_helper ttm drm_exec nvme polyval_clmulni gpu_sched polyval_generic ghash_clmulni_intel drm_suballoc_helper ucsi_ccg nvme_core sha512_ssse3 typec_ucsi drm_panel_backlight_quirks sha256_ssse3 drm_buddy nvme_keyring typec sha1_ssse3 sp5100_tco nvme_auth drm_display_helper cec video wmi fuse irq event stamp: 117172 hardirqs last enabled at (117171): [<ffffffff9e001566>] asm_common_interrupt+0x26/0x40 hardirqs last disabled at (117172): [<ffffffffa1c00f97>] irqentry_enter+0x57/0x60 softirqs last enabled at (117144): [<ffffffff9e614919>] handle_softirqs+0x579/0x840 softirqs last disabled at (117137): [<ffffffff9e614d16>] __irq_exit_rcu+0x126/0x240 CPU: 28 UID: 1000 PID: 1737394 Comm: as Tainted: G W L ------ --- 6.15.0-0.rc6.250515g088d13246a46.54.fc43.x86_64+debug #1 PREEMPT(lazy) Tainted: [W]=WARN, [L]=SOFTLOCKUP Hardware name: ASRock B650I Lightning WiFi/B650I Lightning WiFi, BIOS 3.08 09/18/2024 RIP: 0010:delay_halt_mwaitx+0x20/0x50 Code: 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 53 65 48 8b 05 56 13 0d 04 31 d2 48 89 d1 48 05 00 00 ca a5 0f 01 fa <b8> ff ff ff ff b9 02 00 00 00 48 39 c6 48 0f 47 f0 b8 f0 00 00 00 RSP: 0000:ffffc9003b68f820 EFLAGS: 00000087 RAX: ffff888fda610000 RBX: 000000000000118c RCX: 0000000000000000 RDX: 0000000000000000 RSI: 000000000000118c RDI: 000023b4c02956f6 RBP: 000023b4c02956f6 R08: ffffffffc14b01a9 R09: fffffbfff49570d4 R10: 000000000000001c R11: 0000000000002000 R12: ffffed1040583d43 R13: ffffed1040583d17 R14: 00000000000186a0 R15: ffff888202c1e800 FS: 00007f1da07bcd00(0000) GS:ffff889034970000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f1da0a43930 CR3: 00000003e594a000 CR4: 0000000000f50ef0 PKRU: 55555554 Call Trace: <TASK> delay_halt.part.0+0x33/0x60 dmub_srv_wait_for_idle+0x12f/0x1d0 [amdgpu] dc_dmub_srv_cmd_run_list+0x99/0x2a0 [amdgpu] dc_dmub_srv_drr_update_cmd+0x158/0x340 [amdgpu] ? __lock_acquire+0x40f/0x1160 ? __pfx_dc_dmub_srv_drr_update_cmd+0x10/0x10 [amdgpu] ? lock_acquire.part.0+0xc8/0x270 ? local_clock_noinstr+0xf/0x130 optc1_set_drr+0x18b/0xf20 [amdgpu] ? rcu_is_watching+0x15/0xe0 set_drr_and_clear_adjust_pending+0xa6/0x180 [amdgpu] ? __lock_acquire+0x40f/0x1160 dcn10_set_drr+0x224/0x390 [amdgpu] ? __pfx_dcn10_set_drr+0x10/0x10 [amdgpu] ? local_clock+0x15/0x30 ? __lock_release.isra.0+0x1cb/0x340 ? rcu_is_watching+0x15/0xe0 dc_stream_adjust_vmin_vmax+0x4d9/0xd60 [amdgpu] ? __pfx_dc_stream_adjust_vmin_vmax+0x10/0x10 [amdgpu] ? dm_crtc_high_irq+0x4c8/0xb70 [amdgpu] ? __raw_spin_lock_irqsave+0x60/0x90 dm_crtc_high_irq+0x7b5/0xb70 [amdgpu] ? amdgpu_dm_irq_handler+0xf3/0x2a0 [amdgpu] amdgpu_dm_irq_handler+0x19a/0x2a0 [amdgpu] amdgpu_irq_dispatch+0x286/0x670 [amdgpu] ? find_held_lock+0x2b/0x80 ? __pfx_amdgpu_irq_dispatch+0x10/0x10 [amdgpu] ? __pfx___drm_dev_dbg+0x10/0x10 ? do_raw_spin_unlock+0x59/0x230 ? __wake_up+0x44/0x60 amdgpu_ih_process+0x1c4/0x3a0 [amdgpu] ? __pfx_amdgpu_irq_handler+0x10/0x10 [amdgpu] amdgpu_irq_handler+0x27/0xb0 [amdgpu] ? __pfx_amdgpu_irq_handler+0x10/0x10 [amdgpu] __handle_irq_event_percpu+0x1b5/0x510 handle_irq_event+0xab/0x1c0 handle_edge_irq+0x213/0xb50 __common_interrupt+0xad/0x1d0 ? irq_enter_rcu+0x26/0x190 common_interrupt+0x5a/0xe0 asm_common_interrupt+0x26/0x40 RIP: 0033:0x5639d9e740ee Code: 45 c8 85 d2 74 04 41 80 08 04 48 83 c4 58 4c 89 c8 5b 41 5c 41 5d 41 5e 41 5f 5d c3 48 8b 57 10 44 8b 15 fd bd 08 00 4c 03 0a <45> 85 d2 0f 84 33 ff ff ff 83 c8 04 4c 89 4f 20 88 07 4c 89 c8 c3 RSP: 002b:00007ffce2334e38 EFLAGS: 00000202 RAX: 0000000000000001 RBX: 00007f1d91572388 RCX: 0000000000000002 RDX: 00007f1d90e9c750 RSI: 0000000000000000 RDI: 00007f1d912e5d20 RBP: 00007ffce2334e40 R08: 00007f1d912e5d20 R09: 000000000000e119 R10: 0000000000000001 R11: 0000000000000002 R12: 00007f1d9157f008 R13: 0000000000000000 R14: 00005639d9e74f90 R15: 00007f1da06fe730 </TASK> INFO: NMI handler (perf_event_nmi_handler) took too long to run: 5.441 msecs Environment: GPU: AMD Radeon 6900XT Display: LG TV via HDMI Kernel: 6.15-rc7, built from source using provided config Distro: Fedora Rawhide Motherboard: ASRock B650I Lightning WiFi BIOS: 3.08 (2024-09-18) Additional diagnostic info: Full kernel log ending with stack trace from delay_halt_mwaitx() Series of dc_dmub_srv_drr_update_cmd() and dc_stream_adjust_vmin_vmax() calls in call trace System enters unrecoverable lock state after ~few minutes of heavy compilation -- Best Regards, Mike Gavrilov.
