On 2025-08-20 6:05 p.m., Andi Shyti wrote:
Hi Zhanjun,

...

On Tue, Aug 19, 2025 at 12:00:10PM -0400, Zhanjun Dong wrote:
Boolean flag access from interrupt context might have synchronous issue on

/synchronous/synchronization/
/issue/issues/

multiple processor platform, flage modified by one core might be read as an

/flage/flags/

old value by another core. This issue on interrupt enable flag might causes
interrupt missing or leakage.

/missing/misses/

Fixed by change the data type as automic to add memory synchronization.

This can be re-written: "Change the interrupts.enable type to
atomic to ensure memory synchronization."
Will follow all above comments

No need to resend for this if the patch is fine.

Fixes: a187f13d51fa0 ("drm/i915/guc: handle interrupts from media GuC")

What issue are you exactly trying to fix? And have you tested
Will add:
Closes: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/14834

that this patch is this really fixing it? Where is the flag's
CI report this issue about every 1 or 2 months and close it because no
reproduce.
I can reproduce it in about 1000 rounds(about 9 hours) of IGT test but not
at 100% rate, so it is hard to say really fixed because of hard to
reproduce.
My latest test runs 2200 rounds in total has no reproduce.

misalignment happening?

Please remember that when in interrupt context you are already
running in atomic, so that probably you don't need to have an
additional atomic access to variables.
Interrupt context only read it. When the flag was changed and interrupt was
triggered in very short time, the flag read at interrupt context might read
out old value, if other core(non interrupt context) set the flag and ISR
read out 0, ISR will do simple return and misses interrupt handling, making
it appear as lost interrupt; if other core clear the flag and ISR read out
1, ISR will continue to run and not stop as expected, making it appear as an
interrupt leak. >

Heh... it looks to me very unlikely to happen, but if you say
that this fixes it, then I'm OK with the patch.

I see still one case missing: the error comes here:

   ct_try_receive_message+0x336/0xa10

that is the tasklet that starts after the IRQ. Has the flag
changed from the irq to the tasklet? :-)

Would it make sense to add something like:

@@ -1338,6 +1338,9 @@ static void ct_receive_tasklet_func(struct tasklet_struct 
*t)
  {
         struct intel_guc_ct *ct = from_tasklet(ct, t, receive_tasklet);

+       if (!atomic_read(&guc->interrupts.enabled))
+               return;
+
         ct_try_receive_message(ct);
  }
Yes, I did test with this check, result looks the same as not add.
While, consider if the system is under heavy load, tasklet might get effected as well, so I will add it in next rev.

BTW, have you tried adding a simple memory barrier (even though
in the i915 community people dislike memory barriers, but with a
proper explanation it might save us all this churn).
Yes, will add it.

New rev to be posted after another 9 hours test.

Regards,
Zhanjun Dong.

Thanks,
Andi

Reply via email to