On 6/9/2026 7:55 PM, Rob Clark wrote: > On Tue, Jun 9, 2026 at 6:09 AM Akhil P Oommen <[email protected]> > wrote: >> >> On 6/9/2026 5:50 AM, Rob Clark wrote: >>> On Mon, Jun 8, 2026 at 2:55 PM Akhil P Oommen <[email protected]> >>> wrote: >>>> >>>> On 6/5/2026 12:20 PM, Rob Clark wrote: >>>>> On Thu, Jun 4, 2026 at 1:10 PM Akhil P Oommen <[email protected]> >>>>> wrote: >>>>>> >>>>>> From: Jie Zhang <[email protected]> >>>>>> >>>>>> Once a hang is triggered by the msm_recovery test, the gpu error irq >>>>>> remains asserted and triggers an interrupt storm. In the worst case, >>>>>> this IRQ storm lands on the CPU core where the hangcheck timer is >>>>>> scheduled, blocking it from running. This eventually leads to CPU >>>>>> watchdog timeouts. >>>>>> >>>>>> To fix this, mask the gpu error irqs during msm_recovery test and >>>>>> enable them back during the recovery. >>>>>> >>>>>> Fixes: 5edf2750d998 ("drm/msm: Add debugfs to disable hw err handling") >>>>>> Signed-off-by: Jie Zhang <[email protected]> >>>>>> Signed-off-by: Akhil P Oommen <[email protected]> >>>>>> --- >>>>>> drivers/gpu/drm/msm/adreno/a5xx_gpu.c | 5 +++++ >>>>>> drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 5 ++++- >>>>>> drivers/gpu/drm/msm/adreno/a8xx_gpu.c | 5 ++++- >>>>>> drivers/gpu/drm/msm/msm_gpu.c | 2 ++ >>>>>> 4 files changed, 15 insertions(+), 2 deletions(-) >>>>>> >>>>>> diff --git a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c >>>>>> b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c >>>>>> index 2c0bbac43c52..f1df2514c613 100644 >>>>>> --- a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c >>>>>> +++ b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c >>>>>> @@ -1275,6 +1275,11 @@ static irqreturn_t a5xx_irq(struct msm_gpu *gpu) >>>>>> status & ~A5XX_RBBM_INT_0_MASK_RBBM_AHB_ERROR); >>>>>> >>>>>> if (priv->disable_err_irq) { >>>>>> + /* Turn off interrupts to avoid interrupt storm */ >>>>>> + gpu_write(gpu, REG_A5XX_RBBM_INT_0_MASK, >>>>>> + A5XX_RBBM_INT_0_MASK_CP_CACHE_FLUSH_TS | >>>>>> + A5XX_RBBM_INT_0_MASK_CP_SW); >>>>>> + >>>>>> status &= A5XX_RBBM_INT_0_MASK_CP_CACHE_FLUSH_TS | >>>>>> A5XX_RBBM_INT_0_MASK_CP_SW; >>>>>> } >>>>>> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c >>>>>> b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c >>>>>> index 8b3bb2fd433b..9a4f9d0e1780 100644 >>>>>> --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c >>>>>> +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c >>>>>> @@ -1911,8 +1911,11 @@ static irqreturn_t a6xx_irq(struct msm_gpu *gpu) >>>>>> >>>>>> gpu_write(gpu, REG_A6XX_RBBM_INT_CLEAR_CMD, status); >>>>>> >>>>>> - if (priv->disable_err_irq) >>>>>> + if (priv->disable_err_irq) { >>>>>> + /* Turn off interrupts to avoid interrupt storm */ >>>>>> + gpu_write(gpu, REG_A6XX_RBBM_INT_0_MASK, >>>>>> A6XX_RBBM_INT_0_MASK_CP_CACHE_FLUSH_TS); >>>>>> status &= A6XX_RBBM_INT_0_MASK_CP_CACHE_FLUSH_TS; >>>>>> + } >>>>>> >>>>>> if (status & A6XX_RBBM_INT_0_MASK_RBBM_HANG_DETECT) >>>>>> a6xx_fault_detect_irq(gpu); >>>>>> diff --git a/drivers/gpu/drm/msm/adreno/a8xx_gpu.c >>>>>> b/drivers/gpu/drm/msm/adreno/a8xx_gpu.c >>>>>> index 9e44fd1ae634..0f6fd35bd587 100644 >>>>>> --- a/drivers/gpu/drm/msm/adreno/a8xx_gpu.c >>>>>> +++ b/drivers/gpu/drm/msm/adreno/a8xx_gpu.c >>>>>> @@ -1211,8 +1211,11 @@ irqreturn_t a8xx_irq(struct msm_gpu *gpu) >>>>>> >>>>>> gpu_write(gpu, REG_A8XX_RBBM_INT_CLEAR_CMD, status); >>>>>> >>>>>> - if (priv->disable_err_irq) >>>>>> + if (priv->disable_err_irq) { >>>>>> + /* Turn off interrupts to avoid interrupt storm */ >>>>>> + gpu_write(gpu, REG_A8XX_RBBM_INT_0_MASK, >>>>>> A6XX_RBBM_INT_0_MASK_CP_CACHE_FLUSH_TS); >>>>>> status &= A6XX_RBBM_INT_0_MASK_CP_CACHE_FLUSH_TS; >>>>>> + } >>>>>> >>>>>> if (status & A6XX_RBBM_INT_0_MASK_RBBM_HANG_DETECT) >>>>>> a8xx_fault_detect_irq(gpu); >>>>>> diff --git a/drivers/gpu/drm/msm/msm_gpu.c >>>>>> b/drivers/gpu/drm/msm/msm_gpu.c >>>>>> index 9ac7740a87f0..48ac51f4119b 100644 >>>>>> --- a/drivers/gpu/drm/msm/msm_gpu.c >>>>>> +++ b/drivers/gpu/drm/msm/msm_gpu.c >>>>>> @@ -552,6 +552,8 @@ static void recover_worker(struct kthread_work *work) >>>>>> msm_update_fence(ring->fctx, fence); >>>>>> } >>>>>> >>>>>> + priv->disable_err_irq = false; >>>>> >>>>> Ok, so we rely on recovery to re-enable the error irqs.. that is >>>>> probably ok, given the intended purpose of the debugfs file. And, >>>>> well, it is debugfs. But why do we clear disable_err_irq here? >>>> >>>> Now that we are updating the IRQ mask register which won't reset until >>>> there is a gpu suspend, its side effect will be felt even after >>>> userspace deasserts the debugfs knob, potentially into the next >>>> testcase. This is different from the older behavior. So, I felt it would >>>> be better to reset this flag during the recovery, considering >>>> msm_recovery is the only user of this knob, afaiu. >>> >>> Hmm... maybe debugfs writes should just immediately update the irq >>> mask (if the gpu is powered)? >> >> That needs some plumbing in adreno func table to program the register. >> We can do that if you prefer that, but is it an overkill for this usecase? > > Yeah, that is why I didn't suggest it earlier. But stealth re-enable > of err irqs makes the behavior harder to reason about, which I > dislike. I guess it works now because msm_recovery only does a single > hang-test w/ hw error irq's disabled. But from userspace PoV it seems > natural to expect error irq's to remain disabled until writing debugfs > again. > > Since it is about debugfs and a single igt test, maybe it is just best > to document that this is how it works. > > Or, would it be reasonable just to update the irq mask in gpu->submit()?
IFPC complicates this a little bit. We need to vote OOB to keep the GX domain up to program this register. And I am not sure if we have any existing OOB bit that we can use for this purpose. I feel this debugfs knob is not worth any additional complications in the driver at the moment. We can consider improving this when more usecases show up. For now, I feel it is better to just document the modified behavior. -Akhil > > BR, > -R > > >> -Akhil >> >>> >>> BR, >>> -R >>> >>>> I should have explicitly called out this new behavior of disable_err_irq >>>> in the commit text, but I forgot. >>>> >>>> -Akhil. >>>> >>>>> >>>>> BR, >>>>> -R >>>>> >>>>>> + >>>>>> gpu->funcs->recover(gpu); >>>>>> >>>>>> /* retire completed submits, plus the one that hung: */ >>>>>> >>>>>> -- >>>>>> 2.51.0 >>>>>> >>>> >>
