Hi Zhanjun, ...
> > On Tue, Aug 19, 2025 at 12:00:10PM -0400, Zhanjun Dong wrote: > > > Boolean flag access from interrupt context might have synchronous issue on > > > > /synchronous/synchronization/ > > /issue/issues/ > > > > > multiple processor platform, flage modified by one core might be read as > > > an > > > > /flage/flags/ > > > > > old value by another core. This issue on interrupt enable flag might > > > causes > > > interrupt missing or leakage. > > > > /missing/misses/ > > > > > Fixed by change the data type as automic to add memory synchronization. > > > > This can be re-written: "Change the interrupts.enable type to > > atomic to ensure memory synchronization." > Will follow all above comments No need to resend for this if the patch is fine. > > > Fixes: a187f13d51fa0 ("drm/i915/guc: handle interrupts from media GuC") > > > > What issue are you exactly trying to fix? And have you tested > Will add: > Closes: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/14834 > > > that this patch is this really fixing it? Where is the flag's > CI report this issue about every 1 or 2 months and close it because no > reproduce. > I can reproduce it in about 1000 rounds(about 9 hours) of IGT test but not > at 100% rate, so it is hard to say really fixed because of hard to > reproduce. > My latest test runs 2200 rounds in total has no reproduce. > > > misalignment happening? > > > > Please remember that when in interrupt context you are already > > running in atomic, so that probably you don't need to have an > > additional atomic access to variables. > Interrupt context only read it. When the flag was changed and interrupt was > triggered in very short time, the flag read at interrupt context might read > out old value, if other core(non interrupt context) set the flag and ISR > read out 0, ISR will do simple return and misses interrupt handling, making > it appear as lost interrupt; if other core clear the flag and ISR read out > 1, ISR will continue to run and not stop as expected, making it appear as an > interrupt leak. > Heh... it looks to me very unlikely to happen, but if you say that this fixes it, then I'm OK with the patch. I see still one case missing: the error comes here: ct_try_receive_message+0x336/0xa10 that is the tasklet that starts after the IRQ. Has the flag changed from the irq to the tasklet? :-) Would it make sense to add something like: @@ -1338,6 +1338,9 @@ static void ct_receive_tasklet_func(struct tasklet_struct *t) { struct intel_guc_ct *ct = from_tasklet(ct, t, receive_tasklet); + if (!atomic_read(&guc->interrupts.enabled)) + return; + ct_try_receive_message(ct); } BTW, have you tried adding a simple memory barrier (even though in the i915 community people dislike memory barriers, but with a proper explanation it might save us all this churn). Thanks, Andi