Hi,
I've hit a reliably reproducible hard display hang on a Radeon 780M (RDNA3,
DCN3.1, Phoenix/Hawk Point) and tracked down what looks like the root cause,
with a small candidate fix below. Sending it here in case it's useful — I'm
happy to test patches on the hardware or submit a proper Signed-off-by patch
via git if you'd prefer.
== Summary ==
Re-plugging an external DisplayPort monitor that is powered but in standby
(HPD
still asserted, but unresponsive on AUX) hard-hangs the display. The DMUB
wedges
and floods the log with "dc_dmub_srv_log_diagnostic_data: DMCUB error"
forever;
the compositor and all outputs freeze (frozen mouse) while the rest of the
system keeps running. Only a power-cycle recovers it. If the same monitor is
*awake* at re-plug, link training succeeds and there is no hang — so the
trigger
is specifically link training against a present-but-unresponsive (AUX-dead)
sink.
== Environment ==
- GPU: Radeon 780M, RDNA3 iGPU — HawkPoint1 [1002:1900] (rev d2), DCN3.1
- Kernel 7.0.10; reproduces on earlier kernels too (not a recent regression)
- linux-firmware current; DMUB fw version=0x08005B00
- cmdline: amdgpu.dc=1 amdgpu.dpm=1 amdgpu.dcdebugmask=0x10 (PSR disabled;
no effect)
- Attach: native USB-C DP-alt-mode (also reproduces via a USB4/Thunderbolt
DP tunnel)
- Monitor enters DP standby within ~5s of signal loss and does not wake
over the link
== Steps to reproduce ==
1. External DP monitor connected and working.
2. Unplug it live (do NOT power it off). It enters standby within a few
seconds.
3. Re-plug while it is in standby.
-> HPD link training fails, the DMUB wedges, the whole display hard-hangs.
Not reproducible if the monitor is awake at step 3.
== dmesg ==
WARNING: .../display/dc/link/protocols/link_dp_training.c:1597
at dp_perform_link_training+0x111/0x530 [amdgpu]
Workqueue: events_highpri dm_irq_work_func [amdgpu]
dp_verify_link_cap_with_retries+0x231/0x510 [amdgpu]
link_detect+0x478/0x590 [amdgpu]
handle_hpd_irq_helper+0x277/0x300 [amdgpu]
[drm] *ERROR* dpcd_set_link_settings: core_link_write_dpcd (DP_LINK_BW_SET)
failed
[drm] *ERROR* dpcd_set_link_settings: core_link_write_dpcd
(DP_LANE_COUNT_SET) failed
[drm] *ERROR* dpcd_set_link_settings: core_link_write_dpcd
(DP_DOWNSPREAD_CTRL) failed
[drm] REG_WAIT timeout 1us * 100 tries - dcn31_program_compbuf_size line:141
WARNING: .../display/dc/hubbub/dcn31/dcn31_hubbub.c:151
dcn31_program_compbuf_size
[drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting
diagnostic data
(repeats ~4/s until power-off; was the literal last line logged before a
forced reboot in two captured incidents)
== Root cause ==
On HPD, link_detect() -> dp_verify_link_cap_with_retries() repeatedly calls
dp_perform_link_training() across link settings. A standby sink still
asserts
HPD, so it is never treated as LINK_TRAINING_ABORT (unplugged) and the loop
keeps retrying into it.
Each attempt runs dp_perform_8b_10b_link_training(), which calls
dpcd_set_link_settings() to write DP_DOWNSPREAD_CTRL / DP_LANE_COUNT_SET /
DP_LINK_BW_SET. Against the standby sink every core_link_write_dpcd()
returns
!= DC_OK. But:
- dpcd_set_link_settings() logs each failure and continues, returning only
the
last status;
- dp_perform_8b_10b_link_training() discards that return value and proceeds
to
the clock-recovery / channel-EQ sequences, programming hardware
(dcn31_program_compbuf_size REG_WAIT timeouts) into a link the sink has
already failed to acknowledge — which wedges the DMUB.
There is no DPCD sink-presence check before hardware programming.
== Candidate fix ==
Fail link training gracefully when the sink does not ACK the basic
link-setting
AUX writes: return early from dpcd_set_link_settings() on the first failed
write,
and abort dp_perform_8b_10b_link_training() with LINK_TRAINING_ABORT before
programming hardware when those writes failed. This only affects a fully
AUX-unresponsive sink; a healthy (or marginal-but-answering) sink returns
DC_OK
and trains as before, preserving the existing retry/fallback behaviour.
(Diff below is illustrative — a mail client may reflow its whitespace; I
can send
a clean git-am-able patch on request.)
--- a/.../dc/link/protocols/link_dp_training.c
+++ b/.../dc/link/protocols/link_dp_training.c
status = core_link_write_dpcd(link, DP_DOWNSPREAD_CTRL,
&downspread.raw, sizeof(downspread));
- if (status != DC_OK)
+ if (status != DC_OK) {
DC_LOG_ERROR("...core_link_write_dpcd (DP_DOWNSPREAD_CTRL) failed\n", ...);
+ /* First AUX transaction of link training. If it fails the sink is
+ * unresponsive (e.g. powered but asleep); bail before issuing the
+ * remaining writes / letting the caller program hardware into a
+ * dead link (DMUB wedge on DCN3.1). */
+ return status;
+ }
--- a/.../dc/link/protocols/link_dp_training_8b_10b.c
+++ b/.../dc/link/protocols/link_dp_training_8b_10b.c
/* 1. set link rate, lane count and spread. */
if (lt_settings->lttpr_early_tps2)
set_link_settings_and_perform_early_tps2_retimer_pre_lt_sequence(...);
- else
- dpcd_set_link_settings(link, lt_settings);
+ else if (dpcd_set_link_settings(link, lt_settings) != DC_OK)
+ /* Sink did not ACK the basic link-setting AUX writes (powered but
+ * asleep, still asserting HPD). Abort before programming the training
+ * sequence; on DCN3.1 proceeding wedges the DMUB and hard-hangs the
+ * display. */
+ return LINK_TRAINING_ABORT;
== Open question ==
On DCN3.1 AUX is DMUB-mediated. Is the wedge caused by the hardware
programming
into the dead link (what this patch prevents) or by the failed AUX-over-DMUB
transactions themselves? The trace suggests the former — the AUX writes
return
failure cleanly and the wedge correlates with the subsequent
dcn31_program_compbuf_size programming.
I can build/test on the affected hardware and collect DMUB diagnostics
before/after. Thanks for taking a look.
Greg