bisected - after commit f1c6be3999d2 error appeared: ERROR dc_dmub_srv_log_diagnostic_data: DMCUB error

Pillai, Aurabindo Wed, 21 May 2025 10:13:54 -0700

[AMD Official Use Only - AMD Internal Distribution Only]

Hi Mike,


Thanks for the details. We tried to repro the issue at our end on 9000 and 7000 
series dgpu, but we're not seeing the dmub errors. We were on Ubunti, so we'll 
try Fedora.

--

Regards,
Jay
________________________________
From: Mikhail Gavrilov <[email protected]>
Sent: Wednesday, May 21, 2025 12:40 PM
To: Pillai, Aurabindo <[email protected]>
Cc: Chung, ChiaHsuan (Tom) <[email protected]>; Wu, Ray <[email protected]>; 
Wheeler, Daniel <[email protected]>; Deucher, Alexander 
<[email protected]>; amd-gfx list <[email protected]>; 
dri-devel <[email protected]>; Linux List Kernel Mailing 
<[email protected]>; Linux regressions mailing list 
<[email protected]>
Subject: Re: 6.15-rc6/regression/bisected - after commit f1c6be3999d2 error 
appeared: *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error

On Tue, May 20, 2025 at 9:22 PM Mikhail Gavrilov
<[email protected]> wrote:
>
> > Could you more details about your setup, and how you were able to repro it ?
> >

Hi,
Were you able to reproduce the issue?

I’ve prepared a step-by-step guide that may help:
1. Set up a system with a Radeon 6900XT and an LG TV connected via HDMI.
2. Install Fedora Rawhide.
3. Build and install kernel 6.15-rc7 using my .config (attached in the
first message).
4. Boot into the custom-built kernel.
5. Set the display resolution to 3840×2160 @ 120 Hz.
(This step is optional but may help trigger the issue faster.)
6. Generate heavy system load. I use an infinite kernel rebuild loop:
<fish shell>
> for i in (seq 1 400000); make clean && make -j32 bzImage && make -j32 
> modules; end
</fish shell>

Expected behavior:
System remains stable during heavy load.

Actual behavior:
1. First, the kernel log is filled with repeated messages:
amdgpu 0000:03:00.0: amdgpu: [drm] DP AUX transfer fail:4
2. After a short while under load, more severe errors appear:
amdgpu 0000:03:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data:
DMCUB error - collecting diagnostic data
3. Finally, the system completely freezes with a hard lockup:
watchdog: CPU28: Watchdog detected hard LOCKUP on cpu 28
Modules linked in: uinput rfcomm snd_seq_dummy snd_hrtimer nft_queue
nfnetlink_queue nf_conntrack_netbios_ns nf_conntrack_broadcast
nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet
nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables qrtr bnep
sunrpc binfmt_misc amd_atl intel_rapl_msr intel_rapl_common
edac_mce_amd btusb snd_hda_codec_realtek btrtl mt7921e btintel
mt7921_common snd_hda_codec_generic btbcm mt792x_lib btmtk
snd_hda_scodec_component snd_hda_codec_hdmi snd_usb_audio
mt76_connac_lib kvm_amd snd_hda_intel bluetooth mt76 snd_intel_dspcfg
snd_usbmidi_lib snd_intel_sdw_acpi snd_hda_codec mc kvm spd5118
mac80211 snd_ump snd_hda_core snd_rawmidi snd_hwdep vfat irqbypass fat
snd_seq snd_seq_device wmi_bmof libarc4 rapl r8169 pcspkr snd_pcm
cfg80211 i2c_piix4 snd_timer k10temp i2c_smbus realtek snd rfkill
joydev soundcore gpio_amdpt gpio_generic loop nfnetlink zram
lz4hc_compress lz4_compress amdgpu amdxcp
 i2c_algo_bit drm_ttm_helper ttm drm_exec nvme polyval_clmulni
gpu_sched polyval_generic ghash_clmulni_intel drm_suballoc_helper
ucsi_ccg nvme_core sha512_ssse3 typec_ucsi drm_panel_backlight_quirks
sha256_ssse3 drm_buddy nvme_keyring typec sha1_ssse3 sp5100_tco
nvme_auth drm_display_helper cec video wmi fuse
irq event stamp: 117172
hardirqs last  enabled at (117171): [<ffffffff9e001566>]
asm_common_interrupt+0x26/0x40
hardirqs last disabled at (117172): [<ffffffffa1c00f97>]
irqentry_enter+0x57/0x60
softirqs last  enabled at (117144): [<ffffffff9e614919>]
handle_softirqs+0x579/0x840
softirqs last disabled at (117137): [<ffffffff9e614d16>]
__irq_exit_rcu+0x126/0x240
CPU: 28 UID: 1000 PID: 1737394 Comm: as Tainted: G        W    L
------  ---  6.15.0-0.rc6.250515g088d13246a46.54.fc43.x86_64+debug #1
PREEMPT(lazy)
Tainted: [W]=WARN, [L]=SOFTLOCKUP
Hardware name: ASRock B650I Lightning WiFi/B650I Lightning WiFi, BIOS
3.08 09/18/2024
RIP: 0010:delay_halt_mwaitx+0x20/0x50
Code: 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 53 65
48 8b 05 56 13 0d 04 31 d2 48 89 d1 48 05 00 00 ca a5 0f 01 fa <b8> ff
ff ff ff b9 02 00 00 00 48 39 c6 48 0f 47 f0 b8 f0 00 00 00
RSP: 0000:ffffc9003b68f820 EFLAGS: 00000087
RAX: ffff888fda610000 RBX: 000000000000118c RCX: 0000000000000000
RDX: 0000000000000000 RSI: 000000000000118c RDI: 000023b4c02956f6
RBP: 000023b4c02956f6 R08: ffffffffc14b01a9 R09: fffffbfff49570d4
R10: 000000000000001c R11: 0000000000002000 R12: ffffed1040583d43
R13: ffffed1040583d17 R14: 00000000000186a0 R15: ffff888202c1e800
FS:  00007f1da07bcd00(0000) GS:ffff889034970000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f1da0a43930 CR3: 00000003e594a000 CR4: 0000000000f50ef0
PKRU: 55555554
Call Trace:
 <TASK>
 delay_halt.part.0+0x33/0x60
 dmub_srv_wait_for_idle+0x12f/0x1d0 [amdgpu]
 dc_dmub_srv_cmd_run_list+0x99/0x2a0 [amdgpu]
 dc_dmub_srv_drr_update_cmd+0x158/0x340 [amdgpu]
 ? __lock_acquire+0x40f/0x1160
 ? __pfx_dc_dmub_srv_drr_update_cmd+0x10/0x10 [amdgpu]
 ? lock_acquire.part.0+0xc8/0x270
 ? local_clock_noinstr+0xf/0x130
 optc1_set_drr+0x18b/0xf20 [amdgpu]
 ? rcu_is_watching+0x15/0xe0
 set_drr_and_clear_adjust_pending+0xa6/0x180 [amdgpu]
 ? __lock_acquire+0x40f/0x1160
 dcn10_set_drr+0x224/0x390 [amdgpu]
 ? __pfx_dcn10_set_drr+0x10/0x10 [amdgpu]
 ? local_clock+0x15/0x30
 ? __lock_release.isra.0+0x1cb/0x340
 ? rcu_is_watching+0x15/0xe0
 dc_stream_adjust_vmin_vmax+0x4d9/0xd60 [amdgpu]
 ? __pfx_dc_stream_adjust_vmin_vmax+0x10/0x10 [amdgpu]
 ? dm_crtc_high_irq+0x4c8/0xb70 [amdgpu]
 ? __raw_spin_lock_irqsave+0x60/0x90
 dm_crtc_high_irq+0x7b5/0xb70 [amdgpu]
 ? amdgpu_dm_irq_handler+0xf3/0x2a0 [amdgpu]
 amdgpu_dm_irq_handler+0x19a/0x2a0 [amdgpu]
 amdgpu_irq_dispatch+0x286/0x670 [amdgpu]
 ? find_held_lock+0x2b/0x80
 ? __pfx_amdgpu_irq_dispatch+0x10/0x10 [amdgpu]
 ? __pfx___drm_dev_dbg+0x10/0x10
 ? do_raw_spin_unlock+0x59/0x230
 ? __wake_up+0x44/0x60
 amdgpu_ih_process+0x1c4/0x3a0 [amdgpu]
 ? __pfx_amdgpu_irq_handler+0x10/0x10 [amdgpu]
 amdgpu_irq_handler+0x27/0xb0 [amdgpu]
 ? __pfx_amdgpu_irq_handler+0x10/0x10 [amdgpu]
 __handle_irq_event_percpu+0x1b5/0x510
 handle_irq_event+0xab/0x1c0
 handle_edge_irq+0x213/0xb50
 __common_interrupt+0xad/0x1d0
 ? irq_enter_rcu+0x26/0x190
 common_interrupt+0x5a/0xe0
 asm_common_interrupt+0x26/0x40
RIP: 0033:0x5639d9e740ee
Code: 45 c8 85 d2 74 04 41 80 08 04 48 83 c4 58 4c 89 c8 5b 41 5c 41
5d 41 5e 41 5f 5d c3 48 8b 57 10 44 8b 15 fd bd 08 00 4c 03 0a <45> 85
d2 0f 84 33 ff ff ff 83 c8 04 4c 89 4f 20 88 07 4c 89 c8 c3
RSP: 002b:00007ffce2334e38 EFLAGS: 00000202
RAX: 0000000000000001 RBX: 00007f1d91572388 RCX: 0000000000000002
RDX: 00007f1d90e9c750 RSI: 0000000000000000 RDI: 00007f1d912e5d20
RBP: 00007ffce2334e40 R08: 00007f1d912e5d20 R09: 000000000000e119
R10: 0000000000000001 R11: 0000000000000002 R12: 00007f1d9157f008
R13: 0000000000000000 R14: 00005639d9e74f90 R15: 00007f1da06fe730
 </TASK>
INFO: NMI handler (perf_event_nmi_handler) took too long to run: 5.441 msecs

Environment:
GPU: AMD Radeon 6900XT
Display: LG TV via HDMI
Kernel: 6.15-rc7, built from source using provided config
Distro: Fedora Rawhide
Motherboard: ASRock B650I Lightning WiFi
BIOS: 3.08 (2024-09-18)

Additional diagnostic info:
Full kernel log ending with stack trace from delay_halt_mwaitx()
Series of dc_dmub_srv_drr_update_cmd() and
dc_stream_adjust_vmin_vmax() calls in call trace
System enters unrecoverable lock state after ~few minutes of heavy compilation

--
Best Regards,
Mike Gavrilov.

Re: 6.15-rc6/regression/bisected - after commit f1c6be3999d2 error appeared: *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error

Reply via email to

Re: 6.15-rc6/regression/bisected - after commit f1c6be3999d2 error appeared: ERROR dc_dmub_srv_log_diagnostic_data: DMCUB error