On Mon, Jun 8, 2026 at 2:55 PM Akhil P Oommen <[email protected]> wrote: > > On 6/5/2026 12:20 PM, Rob Clark wrote: > > On Thu, Jun 4, 2026 at 1:10 PM Akhil P Oommen <[email protected]> > > wrote: > >> > >> From: Jie Zhang <[email protected]> > >> > >> Once a hang is triggered by the msm_recovery test, the gpu error irq > >> remains asserted and triggers an interrupt storm. In the worst case, > >> this IRQ storm lands on the CPU core where the hangcheck timer is > >> scheduled, blocking it from running. This eventually leads to CPU > >> watchdog timeouts. > >> > >> To fix this, mask the gpu error irqs during msm_recovery test and > >> enable them back during the recovery. > >> > >> Fixes: 5edf2750d998 ("drm/msm: Add debugfs to disable hw err handling") > >> Signed-off-by: Jie Zhang <[email protected]> > >> Signed-off-by: Akhil P Oommen <[email protected]> > >> --- > >> drivers/gpu/drm/msm/adreno/a5xx_gpu.c | 5 +++++ > >> drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 5 ++++- > >> drivers/gpu/drm/msm/adreno/a8xx_gpu.c | 5 ++++- > >> drivers/gpu/drm/msm/msm_gpu.c | 2 ++ > >> 4 files changed, 15 insertions(+), 2 deletions(-) > >> > >> diff --git a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c > >> b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c > >> index 2c0bbac43c52..f1df2514c613 100644 > >> --- a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c > >> +++ b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c > >> @@ -1275,6 +1275,11 @@ static irqreturn_t a5xx_irq(struct msm_gpu *gpu) > >> status & ~A5XX_RBBM_INT_0_MASK_RBBM_AHB_ERROR); > >> > >> if (priv->disable_err_irq) { > >> + /* Turn off interrupts to avoid interrupt storm */ > >> + gpu_write(gpu, REG_A5XX_RBBM_INT_0_MASK, > >> + A5XX_RBBM_INT_0_MASK_CP_CACHE_FLUSH_TS | > >> + A5XX_RBBM_INT_0_MASK_CP_SW); > >> + > >> status &= A5XX_RBBM_INT_0_MASK_CP_CACHE_FLUSH_TS | > >> A5XX_RBBM_INT_0_MASK_CP_SW; > >> } > >> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c > >> b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c > >> index 8b3bb2fd433b..9a4f9d0e1780 100644 > >> --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c > >> +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c > >> @@ -1911,8 +1911,11 @@ static irqreturn_t a6xx_irq(struct msm_gpu *gpu) > >> > >> gpu_write(gpu, REG_A6XX_RBBM_INT_CLEAR_CMD, status); > >> > >> - if (priv->disable_err_irq) > >> + if (priv->disable_err_irq) { > >> + /* Turn off interrupts to avoid interrupt storm */ > >> + gpu_write(gpu, REG_A6XX_RBBM_INT_0_MASK, > >> A6XX_RBBM_INT_0_MASK_CP_CACHE_FLUSH_TS); > >> status &= A6XX_RBBM_INT_0_MASK_CP_CACHE_FLUSH_TS; > >> + } > >> > >> if (status & A6XX_RBBM_INT_0_MASK_RBBM_HANG_DETECT) > >> a6xx_fault_detect_irq(gpu); > >> diff --git a/drivers/gpu/drm/msm/adreno/a8xx_gpu.c > >> b/drivers/gpu/drm/msm/adreno/a8xx_gpu.c > >> index 9e44fd1ae634..0f6fd35bd587 100644 > >> --- a/drivers/gpu/drm/msm/adreno/a8xx_gpu.c > >> +++ b/drivers/gpu/drm/msm/adreno/a8xx_gpu.c > >> @@ -1211,8 +1211,11 @@ irqreturn_t a8xx_irq(struct msm_gpu *gpu) > >> > >> gpu_write(gpu, REG_A8XX_RBBM_INT_CLEAR_CMD, status); > >> > >> - if (priv->disable_err_irq) > >> + if (priv->disable_err_irq) { > >> + /* Turn off interrupts to avoid interrupt storm */ > >> + gpu_write(gpu, REG_A8XX_RBBM_INT_0_MASK, > >> A6XX_RBBM_INT_0_MASK_CP_CACHE_FLUSH_TS); > >> status &= A6XX_RBBM_INT_0_MASK_CP_CACHE_FLUSH_TS; > >> + } > >> > >> if (status & A6XX_RBBM_INT_0_MASK_RBBM_HANG_DETECT) > >> a8xx_fault_detect_irq(gpu); > >> diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c > >> index 9ac7740a87f0..48ac51f4119b 100644 > >> --- a/drivers/gpu/drm/msm/msm_gpu.c > >> +++ b/drivers/gpu/drm/msm/msm_gpu.c > >> @@ -552,6 +552,8 @@ static void recover_worker(struct kthread_work *work) > >> msm_update_fence(ring->fctx, fence); > >> } > >> > >> + priv->disable_err_irq = false; > > > > Ok, so we rely on recovery to re-enable the error irqs.. that is > > probably ok, given the intended purpose of the debugfs file. And, > > well, it is debugfs. But why do we clear disable_err_irq here? > > Now that we are updating the IRQ mask register which won't reset until > there is a gpu suspend, its side effect will be felt even after > userspace deasserts the debugfs knob, potentially into the next > testcase. This is different from the older behavior. So, I felt it would > be better to reset this flag during the recovery, considering > msm_recovery is the only user of this knob, afaiu.
Hmm... maybe debugfs writes should just immediately update the irq mask (if the gpu is powered)? BR, -R > I should have explicitly called out this new behavior of disable_err_irq > in the commit text, but I forgot. > > -Akhil. > > > > > BR, > > -R > > > >> + > >> gpu->funcs->recover(gpu); > >> > >> /* retire completed submits, plus the one that hung: */ > >> > >> -- > >> 2.51.0 > >> >
