SCU polling loops to prevent soft lockups

w15303746062 Thu, 04 Jun 2026 20:23:15 -0700


Hi Thomas and all,


A gentle ping on this patch.

It has been about three weeks since submission. 

This patch fixes a critical soft lockup issue we observed when the ASPEED 
hardware becomes unresponsive. 

Since it uses a safe timeout mechanism without altering the existing API 
signatures, it should be a straightforward and low-risk mitigation.

Could anyone please help review this when you have a moment, or let me know if 
any modifications are needed?

Best regards,
Mingyu

At 2026-05-13 19:39:49, [email protected] wrote:
>From: Mingyu Wang <[email protected]>
>
>While validating the driver using DevGen (a framework that synthesizes
>virtual device models directly from driver source code via LLM guidance),
>a severe soft lockup was observed.
>
>The hardware polling loops in `__ast_mindwm`, `__ast_moutdwm`, and
>`ast_2500_patch_ahb` lack a timeout mechanism. On bare-metal systems,
>if the ASPEED chip becomes unresponsive or a PCIe bus fault occurs,
>the CPU will spin indefinitely in these loops. This results in a system
>hang, triggering the watchdog soft lockup and causing subsequent I/O
>starvation (e.g., blocking jbd2).
>
>Fix this by introducing a bounded loop with a safe timeout of
>approximately 100ms using `udelay(10)`. Using `udelay()` ensures
>that the fix remains safe even if these accessors are called from
>an atomic context or while holding spinlocks. If the hardware fails
>to respond, the loop breaks and emits a `WARN_ONCE`, allowing the
>kernel to degrade gracefully and preventing complete system paralysis.
>
>Signed-off-by: Mingyu Wang <[email protected]>
>---
>Hi Thomas,
>
>Thanks for the prompt response and confirmation!
>
>Instead of just waiting for threshold suggestions, I have drafted this 
>patch to address the soft lockup. Since changing the return types of 
>`__ast_mindwm` and `__ast_moutdwm` to propagate error codes (e.g., 
>`-ETIMEDOUT`) would require an intrusive refactoring across the entire 
>AST driver, I took a more defensive, minimal-invasive approach.
>
>To avoid the risk of sleeping in an atomic context (in case these 
>low-level I/O accessors are ever called under a spinlock), I used 
>a bounded loop with `udelay(10)` and a maximum of 10000 iterations 
>(approx. 100ms total timeout). If the ASPEED hardware completely 
>fails to respond, it breaks the infinite loop, emits a `WARN_ONCE`, 
>and prevents the CPU from halting the entire system.
>
>Please let me know if you think the 100ms threshold and the `udelay` 
>approach are appropriate for this specific AHB/SCU hardware sequence.
>
> drivers/gpu/drm/ast/ast_2500.c |  8 +++++++-
> drivers/gpu/drm/ast/ast_post.c | 16 ++++++++++++++--
> 2 files changed, 21 insertions(+), 3 deletions(-)
>
>diff --git a/drivers/gpu/drm/ast/ast_2500.c b/drivers/gpu/drm/ast/ast_2500.c
>index 2a52af0ded56..08d18f90201a 100644
>--- a/drivers/gpu/drm/ast/ast_2500.c
>+++ b/drivers/gpu/drm/ast/ast_2500.c
>@@ -107,6 +107,7 @@ static const u32 
>ast2500_ddr4_1600_timing_table[REGTBL_NUM] = {
> void ast_2500_patch_ahb(void __iomem *regs)
> {
>       u32 data;
>+      int retries = 10000; /* ~100ms timeout */
> 
>       /* Clear bus lock condition */
>       __ast_moutdwm(regs, 0x1e600000, 0xAEED1A03);
>@@ -136,7 +137,12 @@ void ast_2500_patch_ahb(void __iomem *regs)
>       do {
>               __ast_moutdwm(regs, 0x1e6e2000, 0x1688A8A8);
>               data = __ast_mindwm(regs, 0x1e6e2000);
>-      } while (data != 1);
>+              if (data == 1)
>+                      break;
>+              udelay(10);
>+      } while (--retries);
>+
>+      WARN_ONCE(!retries, "ast: timeout waiting for AHB patch\n");
> 
>       __ast_moutdwm(regs, 0x1e6e207c, 0x08000000); /* clear fast reset */
> }
>diff --git a/drivers/gpu/drm/ast/ast_post.c b/drivers/gpu/drm/ast/ast_post.c
>index b72914dbed38..66eb80925e27 100644
>--- a/drivers/gpu/drm/ast/ast_post.c
>+++ b/drivers/gpu/drm/ast/ast_post.c
>@@ -37,13 +37,19 @@
> u32 __ast_mindwm(void __iomem *regs, u32 r)
> {
>       u32 data;
>+      int retries = 10000; /* ~100ms timeout */
> 
>       __ast_write32(regs, 0xf004, r & 0xffff0000);
>       __ast_write32(regs, 0xf000, 0x1);
> 
>       do {
>               data = __ast_read32(regs, 0xf004) & 0xffff0000;
>-      } while (data != (r & 0xffff0000));
>+              if (data == (r & 0xffff0000))
>+                      break;
>+              udelay(10);
>+      } while (--retries);
>+
>+      WARN_ONCE(!retries, "ast: timeout reading from AHB/SCU\n");
> 
>       return __ast_read32(regs, 0x10000 + (r & 0x0000ffff));
> }
>@@ -51,13 +57,19 @@ u32 __ast_mindwm(void __iomem *regs, u32 r)
> void __ast_moutdwm(void __iomem *regs, u32 r, u32 v)
> {
>       u32 data;
>+      int retries = 10000; /* ~100ms timeout */
> 
>       __ast_write32(regs, 0xf004, r & 0xffff0000);
>       __ast_write32(regs, 0xf000, 0x1);
> 
>       do {
>               data = __ast_read32(regs, 0xf004) & 0xffff0000;
>-      } while (data != (r & 0xffff0000));
>+              if (data == (r & 0xffff0000))
>+                      break;
>+              udelay(10);
>+      } while (--retries);
>+
>+      WARN_ONCE(!retries, "ast: timeout writing to AHB/SCU\n");
> 
>       __ast_write32(regs, 0x10000 + (r & 0x0000ffff), v);
> }
>-- 
>2.34.1

Re:[PATCH] drm/ast: Add timeouts to AHB/SCU polling loops to prevent soft lockups

Reply via email to