On 6/9/2026 5:50 AM, Rob Clark wrote:
> On Mon, Jun 8, 2026 at 2:55 PM Akhil P Oommen <[email protected]> 
> wrote:
>>
>> On 6/5/2026 12:20 PM, Rob Clark wrote:
>>> On Thu, Jun 4, 2026 at 1:10 PM Akhil P Oommen <[email protected]> 
>>> wrote:
>>>>
>>>> From: Jie Zhang <[email protected]>
>>>>
>>>> Once a hang is triggered by the msm_recovery test, the gpu error irq
>>>> remains asserted and triggers an interrupt storm. In the worst case,
>>>> this IRQ storm lands on the CPU core where the hangcheck timer is
>>>> scheduled, blocking it from running. This eventually leads to CPU
>>>> watchdog timeouts.
>>>>
>>>> To fix this, mask the gpu error irqs during msm_recovery test and
>>>> enable them back during the recovery.
>>>>
>>>> Fixes: 5edf2750d998 ("drm/msm: Add debugfs to disable hw err handling")
>>>> Signed-off-by: Jie Zhang <[email protected]>
>>>> Signed-off-by: Akhil P Oommen <[email protected]>
>>>> ---
>>>>  drivers/gpu/drm/msm/adreno/a5xx_gpu.c | 5 +++++
>>>>  drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 5 ++++-
>>>>  drivers/gpu/drm/msm/adreno/a8xx_gpu.c | 5 ++++-
>>>>  drivers/gpu/drm/msm/msm_gpu.c         | 2 ++
>>>>  4 files changed, 15 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c 
>>>> b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
>>>> index 2c0bbac43c52..f1df2514c613 100644
>>>> --- a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
>>>> +++ b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
>>>> @@ -1275,6 +1275,11 @@ static irqreturn_t a5xx_irq(struct msm_gpu *gpu)
>>>>                 status & ~A5XX_RBBM_INT_0_MASK_RBBM_AHB_ERROR);
>>>>
>>>>         if (priv->disable_err_irq) {
>>>> +               /* Turn off interrupts to avoid interrupt storm */
>>>> +               gpu_write(gpu, REG_A5XX_RBBM_INT_0_MASK,
>>>> +                              A5XX_RBBM_INT_0_MASK_CP_CACHE_FLUSH_TS |
>>>> +                              A5XX_RBBM_INT_0_MASK_CP_SW);
>>>> +
>>>>                 status &= A5XX_RBBM_INT_0_MASK_CP_CACHE_FLUSH_TS |
>>>>                           A5XX_RBBM_INT_0_MASK_CP_SW;
>>>>         }
>>>> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c 
>>>> b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
>>>> index 8b3bb2fd433b..9a4f9d0e1780 100644
>>>> --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
>>>> +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
>>>> @@ -1911,8 +1911,11 @@ static irqreturn_t a6xx_irq(struct msm_gpu *gpu)
>>>>
>>>>         gpu_write(gpu, REG_A6XX_RBBM_INT_CLEAR_CMD, status);
>>>>
>>>> -       if (priv->disable_err_irq)
>>>> +       if (priv->disable_err_irq) {
>>>> +               /* Turn off interrupts to avoid interrupt storm */
>>>> +               gpu_write(gpu, REG_A6XX_RBBM_INT_0_MASK, 
>>>> A6XX_RBBM_INT_0_MASK_CP_CACHE_FLUSH_TS);
>>>>                 status &= A6XX_RBBM_INT_0_MASK_CP_CACHE_FLUSH_TS;
>>>> +       }
>>>>
>>>>         if (status & A6XX_RBBM_INT_0_MASK_RBBM_HANG_DETECT)
>>>>                 a6xx_fault_detect_irq(gpu);
>>>> diff --git a/drivers/gpu/drm/msm/adreno/a8xx_gpu.c 
>>>> b/drivers/gpu/drm/msm/adreno/a8xx_gpu.c
>>>> index 9e44fd1ae634..0f6fd35bd587 100644
>>>> --- a/drivers/gpu/drm/msm/adreno/a8xx_gpu.c
>>>> +++ b/drivers/gpu/drm/msm/adreno/a8xx_gpu.c
>>>> @@ -1211,8 +1211,11 @@ irqreturn_t a8xx_irq(struct msm_gpu *gpu)
>>>>
>>>>         gpu_write(gpu, REG_A8XX_RBBM_INT_CLEAR_CMD, status);
>>>>
>>>> -       if (priv->disable_err_irq)
>>>> +       if (priv->disable_err_irq) {
>>>> +               /* Turn off interrupts to avoid interrupt storm */
>>>> +               gpu_write(gpu, REG_A8XX_RBBM_INT_0_MASK, 
>>>> A6XX_RBBM_INT_0_MASK_CP_CACHE_FLUSH_TS);
>>>>                 status &= A6XX_RBBM_INT_0_MASK_CP_CACHE_FLUSH_TS;
>>>> +       }
>>>>
>>>>         if (status & A6XX_RBBM_INT_0_MASK_RBBM_HANG_DETECT)
>>>>                 a8xx_fault_detect_irq(gpu);
>>>> diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
>>>> index 9ac7740a87f0..48ac51f4119b 100644
>>>> --- a/drivers/gpu/drm/msm/msm_gpu.c
>>>> +++ b/drivers/gpu/drm/msm/msm_gpu.c
>>>> @@ -552,6 +552,8 @@ static void recover_worker(struct kthread_work *work)
>>>>                 msm_update_fence(ring->fctx, fence);
>>>>         }
>>>>
>>>> +       priv->disable_err_irq = false;
>>>
>>> Ok, so we rely on recovery to re-enable the error irqs..  that is
>>> probably ok, given the intended purpose of the debugfs file.  And,
>>> well, it is debugfs.  But why do we clear disable_err_irq here?
>>
>> Now that we are updating the IRQ mask register which won't reset until
>> there is a gpu suspend, its side effect will be felt even after
>> userspace deasserts the debugfs knob, potentially into the next
>> testcase. This is different from the older behavior. So, I felt it would
>> be better to reset this flag during the recovery, considering
>> msm_recovery is the only user of this knob, afaiu.
> 
> Hmm... maybe debugfs writes should just immediately update the irq
> mask (if the gpu is powered)?

That needs some plumbing in adreno func table to program the register.
We can do that if you prefer that, but is it an overkill for this usecase?

-Akhil

> 
> BR,
> -R
> 
>> I should have explicitly called out this new behavior of disable_err_irq
>> in the commit text, but I forgot.
>>
>> -Akhil.
>>
>>>
>>> BR,
>>> -R
>>>
>>>> +
>>>>         gpu->funcs->recover(gpu);
>>>>
>>>>         /* retire completed submits, plus the one that hung: */
>>>>
>>>> --
>>>> 2.51.0
>>>>
>>

Reply via email to