On 6/9/2026 7:55 PM, Rob Clark wrote:
> On Tue, Jun 9, 2026 at 6:09 AM Akhil P Oommen <[email protected]> 
> wrote:
>>
>> On 6/9/2026 5:50 AM, Rob Clark wrote:
>>> On Mon, Jun 8, 2026 at 2:55 PM Akhil P Oommen <[email protected]> 
>>> wrote:
>>>>
>>>> On 6/5/2026 12:20 PM, Rob Clark wrote:
>>>>> On Thu, Jun 4, 2026 at 1:10 PM Akhil P Oommen <[email protected]> 
>>>>> wrote:
>>>>>>
>>>>>> From: Jie Zhang <[email protected]>
>>>>>>
>>>>>> Once a hang is triggered by the msm_recovery test, the gpu error irq
>>>>>> remains asserted and triggers an interrupt storm. In the worst case,
>>>>>> this IRQ storm lands on the CPU core where the hangcheck timer is
>>>>>> scheduled, blocking it from running. This eventually leads to CPU
>>>>>> watchdog timeouts.
>>>>>>
>>>>>> To fix this, mask the gpu error irqs during msm_recovery test and
>>>>>> enable them back during the recovery.
>>>>>>
>>>>>> Fixes: 5edf2750d998 ("drm/msm: Add debugfs to disable hw err handling")
>>>>>> Signed-off-by: Jie Zhang <[email protected]>
>>>>>> Signed-off-by: Akhil P Oommen <[email protected]>
>>>>>> ---
>>>>>>  drivers/gpu/drm/msm/adreno/a5xx_gpu.c | 5 +++++
>>>>>>  drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 5 ++++-
>>>>>>  drivers/gpu/drm/msm/adreno/a8xx_gpu.c | 5 ++++-
>>>>>>  drivers/gpu/drm/msm/msm_gpu.c         | 2 ++
>>>>>>  4 files changed, 15 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c 
>>>>>> b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
>>>>>> index 2c0bbac43c52..f1df2514c613 100644
>>>>>> --- a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
>>>>>> +++ b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
>>>>>> @@ -1275,6 +1275,11 @@ static irqreturn_t a5xx_irq(struct msm_gpu *gpu)
>>>>>>                 status & ~A5XX_RBBM_INT_0_MASK_RBBM_AHB_ERROR);
>>>>>>
>>>>>>         if (priv->disable_err_irq) {
>>>>>> +               /* Turn off interrupts to avoid interrupt storm */
>>>>>> +               gpu_write(gpu, REG_A5XX_RBBM_INT_0_MASK,
>>>>>> +                              A5XX_RBBM_INT_0_MASK_CP_CACHE_FLUSH_TS |
>>>>>> +                              A5XX_RBBM_INT_0_MASK_CP_SW);
>>>>>> +
>>>>>>                 status &= A5XX_RBBM_INT_0_MASK_CP_CACHE_FLUSH_TS |
>>>>>>                           A5XX_RBBM_INT_0_MASK_CP_SW;
>>>>>>         }
>>>>>> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c 
>>>>>> b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
>>>>>> index 8b3bb2fd433b..9a4f9d0e1780 100644
>>>>>> --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
>>>>>> +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
>>>>>> @@ -1911,8 +1911,11 @@ static irqreturn_t a6xx_irq(struct msm_gpu *gpu)
>>>>>>
>>>>>>         gpu_write(gpu, REG_A6XX_RBBM_INT_CLEAR_CMD, status);
>>>>>>
>>>>>> -       if (priv->disable_err_irq)
>>>>>> +       if (priv->disable_err_irq) {
>>>>>> +               /* Turn off interrupts to avoid interrupt storm */
>>>>>> +               gpu_write(gpu, REG_A6XX_RBBM_INT_0_MASK, 
>>>>>> A6XX_RBBM_INT_0_MASK_CP_CACHE_FLUSH_TS);
>>>>>>                 status &= A6XX_RBBM_INT_0_MASK_CP_CACHE_FLUSH_TS;
>>>>>> +       }
>>>>>>
>>>>>>         if (status & A6XX_RBBM_INT_0_MASK_RBBM_HANG_DETECT)
>>>>>>                 a6xx_fault_detect_irq(gpu);
>>>>>> diff --git a/drivers/gpu/drm/msm/adreno/a8xx_gpu.c 
>>>>>> b/drivers/gpu/drm/msm/adreno/a8xx_gpu.c
>>>>>> index 9e44fd1ae634..0f6fd35bd587 100644
>>>>>> --- a/drivers/gpu/drm/msm/adreno/a8xx_gpu.c
>>>>>> +++ b/drivers/gpu/drm/msm/adreno/a8xx_gpu.c
>>>>>> @@ -1211,8 +1211,11 @@ irqreturn_t a8xx_irq(struct msm_gpu *gpu)
>>>>>>
>>>>>>         gpu_write(gpu, REG_A8XX_RBBM_INT_CLEAR_CMD, status);
>>>>>>
>>>>>> -       if (priv->disable_err_irq)
>>>>>> +       if (priv->disable_err_irq) {
>>>>>> +               /* Turn off interrupts to avoid interrupt storm */
>>>>>> +               gpu_write(gpu, REG_A8XX_RBBM_INT_0_MASK, 
>>>>>> A6XX_RBBM_INT_0_MASK_CP_CACHE_FLUSH_TS);
>>>>>>                 status &= A6XX_RBBM_INT_0_MASK_CP_CACHE_FLUSH_TS;
>>>>>> +       }
>>>>>>
>>>>>>         if (status & A6XX_RBBM_INT_0_MASK_RBBM_HANG_DETECT)
>>>>>>                 a8xx_fault_detect_irq(gpu);
>>>>>> diff --git a/drivers/gpu/drm/msm/msm_gpu.c 
>>>>>> b/drivers/gpu/drm/msm/msm_gpu.c
>>>>>> index 9ac7740a87f0..48ac51f4119b 100644
>>>>>> --- a/drivers/gpu/drm/msm/msm_gpu.c
>>>>>> +++ b/drivers/gpu/drm/msm/msm_gpu.c
>>>>>> @@ -552,6 +552,8 @@ static void recover_worker(struct kthread_work *work)
>>>>>>                 msm_update_fence(ring->fctx, fence);
>>>>>>         }
>>>>>>
>>>>>> +       priv->disable_err_irq = false;
>>>>>
>>>>> Ok, so we rely on recovery to re-enable the error irqs..  that is
>>>>> probably ok, given the intended purpose of the debugfs file.  And,
>>>>> well, it is debugfs.  But why do we clear disable_err_irq here?
>>>>
>>>> Now that we are updating the IRQ mask register which won't reset until
>>>> there is a gpu suspend, its side effect will be felt even after
>>>> userspace deasserts the debugfs knob, potentially into the next
>>>> testcase. This is different from the older behavior. So, I felt it would
>>>> be better to reset this flag during the recovery, considering
>>>> msm_recovery is the only user of this knob, afaiu.
>>>
>>> Hmm... maybe debugfs writes should just immediately update the irq
>>> mask (if the gpu is powered)?
>>
>> That needs some plumbing in adreno func table to program the register.
>> We can do that if you prefer that, but is it an overkill for this usecase?
> 
> Yeah, that is why I didn't suggest it earlier.  But stealth re-enable
> of err irqs makes the behavior harder to reason about, which I
> dislike.  I guess it works now because msm_recovery only does a single
> hang-test w/ hw error irq's disabled.  But from userspace PoV it seems
> natural to expect error irq's to remain disabled until writing debugfs
> again.
> 
> Since it is about debugfs and a single igt test, maybe it is just best
> to document that this is how it works.
> 
> Or, would it be reasonable just to update the irq mask in gpu->submit()?

IFPC complicates this a little bit. We need to vote OOB to keep the GX
domain up to program this register. And I am not sure if we have any
existing OOB bit that we can use for this purpose.

I feel this debugfs knob is not worth any additional complications in
the driver at the moment. We can consider improving this when more
usecases show up. For now, I feel it is better to just document the
modified behavior.

-Akhil

> 
> BR,
> -R
> 
> 
>> -Akhil
>>
>>>
>>> BR,
>>> -R
>>>
>>>> I should have explicitly called out this new behavior of disable_err_irq
>>>> in the commit text, but I forgot.
>>>>
>>>> -Akhil.
>>>>
>>>>>
>>>>> BR,
>>>>> -R
>>>>>
>>>>>> +
>>>>>>         gpu->funcs->recover(gpu);
>>>>>>
>>>>>>         /* retire completed submits, plus the one that hung: */
>>>>>>
>>>>>> --
>>>>>> 2.51.0
>>>>>>
>>>>
>>

Reply via email to