On Sat, Jun 27, 2020 at 03:14:14PM -0700, Andy Lutomirski wrote:
>
> > On Jun 27, 2020, at 2:46 PM, Paul E. McKenney <[email protected]> wrote:
> >
> > On Sat, Jun 27, 2020 at 02:02:15PM -0700, Andy Lutomirski wrote:
> >>> On Fri, Jun 26, 2020 at 2:05 PM Paul E. McKenney <[email protected]>
> >>> wrote:
> >>>
> >>> Currently, can_stop_idle_tick() prints "NOHZ: local_softirq_pending HH"
> >>> (where "HH" is the hexadecimal softirq vector number) when one or more
> >>> non-RCU softirq handlers are still enablded when checking to stop the
> >>> scheduler-tick interrupt. This message is not as enlightening as one
> >>> might hope, so this commit changes it to "NOHZ tick-stop error: Non-RCU
> >>> local softirq work is pending, handler #HH.
> >>
> >> Thank you! It would be even better if it would explain *why* the
> >> problem happened, but I suppose this code doesn't actually know.
> >
> > Glad to help!
> >
> > To your point, is it possible to bisect the appearance of this message,
> > or is it as usual non-reproducible? (Hey, had to ask!)
> >
> >
>
> In this particular case, I tracked it down by good old fashioned sleuthing
> for bugs, but it’s still unclear to me precisely how NOHZ gets involved. The
> bug is that we were entering the kernel from usermode, doing nmi_enter(),
> turning on interrupts, maybe getting a page fault, raising a signal, turning
> off interrupts, nmi_exit(), and back to usermode, with the signal still
> queued and undelivered. This is all kinds of bad, but I still don’t
> understand what softirqs or idle have to do with it.
>
> But I have the bug fixed now!
Glad you found it!
Thanx, Paul