On Wed, Jan 09, 2019 at 11:20:50PM +0100, Heiner Kallweit wrote: > On 28.12.2018 07:39, Heiner Kallweit wrote: > > On 28.12.2018 07:34, Heiner Kallweit wrote: > >> On 28.12.2018 02:31, Frederic Weisbecker wrote: > >>> On Fri, Dec 28, 2018 at 12:11:12AM +0100, Heiner Kallweit wrote: > >>>> > >> [...] > >>> > >>> Interesting, the softirq is raised from hardirq but it's not handled in > >>> the end of > >>> the IRQ. Are you running threaded IRQS by any chance? If so I would > >>> expect ksoftirqd > >>> to handle the pending work before we go idle. However I can imagine a > >>> small window > >>> where such an expectation may not be met: if the softirq is raised after > >>> the ksoftirqd > >>> thread is parked (CPUHP_AP_SMPBOOT_THREADS), which is right before we > >>> disable the CPU > >>> (CPUHP_TEARDOWN_CPU). > >>> > >> I have a network driver (r8169) using NAPI which runs in softirq context > >> AFAIK. > >> For testing purposes I sometimes trigger system suspend via network, so > >> there is > >> network adapter activity when system suspends. Apart from that nothing > >> really > >> exciting: > >> CPU0 CPU1 CPU2 CPU3 > >> 0: 43 0 0 0 IO-APIC 2-edge > >> timer > >> 1: 4 0 0 0 IO-APIC 1-edge > >> i8042 > >> 8: 0 1 0 0 IO-APIC 8-fasteoi > >> rtc0 > >> 9: 0 0 0 0 IO-APIC 9-fasteoi > >> acpi > >> 12: 0 0 0 5 IO-APIC 12-edge > >> i8042 > >> 120: 0 0 0 0 PCI-MSI 311296-edge > >> PCIe PME > >> 121: 0 0 0 0 PCI-MSI 315392-edge > >> PCIe PME > >> 122: 0 0 0 0 PCI-MSI 327680-edge > >> PCIe PME > >> 123: 0 0 3328 0 PCI-MSI 294912-edge > >> ahci[0000:00:12.0] > >> 124: 0 133 0 0 PCI-MSI 344064-edge > >> xhci_hcd > >> 125: 0 0 32 0 PCI-MSI 245760-edge > >> mei_me > >> 127: 381 0 0 0 PCI-MSI 1572864-edge > >> enp3s0 > >> 128: 0 0 0 236 PCI-MSI 32768-edge > >> i915 > >> 129: 0 374 0 0 PCI-MSI 229376-edge > >> snd_hda_intel:card0 > >> > >>> I don't know if we can afford to ignore a softirq even at this late > >>> stage. We should > >>> probably avoid leaking any. So here is a possible fix, if you don't mind > >>> trying: > >>> > >> I tested your patch and at least in the first minutes of testing couldn't > >> reproduce > >> the issue any longer. I tested manual system suspend and the following > >> script you > >> sent when we started to analyze the issue. > >> > > > > Also after some more time the issue didn't occur again. So it seems your > > analysis > > was right and also the approach to fix it. Thanks! > > Will let you know in case the issue should pop up again under special > > circumstances. > > > Frederic, so far this fix didn't appear in linux-next, are you going to > submit it?
Yep, I'll cook up a proper changelog and let Thomas judge if the change is worth.

