On Fri, Nov 14, 2025, at 7:01 AM, Valentin Schneider wrote:
> Context
> =======
>
> We've observed within Red Hat that isolated, NOHZ_FULL CPUs running a
> pure-userspace application get regularly interrupted by IPIs sent from
> housekeeping CPUs. Those IPIs are caused by activity on the housekeeping CPUs
> leading to various on_each_cpu() calls, e.g.:
>
> The heart of this series is the thought that while we cannot remove NOHZ_FULL
> CPUs from the list of CPUs targeted by these IPIs, they may not have to
> execute
> the callbacks immediately. Anything that only affects kernelspace can wait
> until the next user->kernel transition, providing it can be executed "early
> enough" in the entry code.
>
I want to point out that there's another option here, although anyone trying to
implement it would be fighting against quite a lot of history.
Logically, each CPU is in one of a handful of states: user mode, idle, normal
kernel mode (possibly subdivided into IRQ, etc), and a handful of very narrow
windows, hopefully uninstrumented and not accessing any PTEs that might be
invalid, in the entry and exit paths where any state in memory could be out of
sync with actual CPU state. (The latter includes right after the CPU switches
to kernel mode, for example.) And NMI and MCE and whatever weird "security"
entry types that Intel and AMD love to add.
The way the kernel *currently* deals with this has two big historical oddities:
1. The entry and exit code cares about ti_flags, which is per-*task*, which
means that atomically poking it from other CPUs involves the runqueue lock or
other shenanigans (see the idle nr_polling code for example), and also that
it's not accessible from the user page tables if PTI is on.
2. The actual heavyweight atomic part (context tracking) was built for RCU, and
it's sort or bolted on, and, as you've observed in this series, it's really
quite awkward to do things that aren't RCU using context tracking.
If this were a greenfield project, I think there's a straightforward approach
that's much nicer: stick everything into a single percpu flags structure.
Imagine we have cpu_flags, which tracks both the current state of the CPU and
what work needs to be done on state changes. On exit to user mode, we would
atomically set the mode to USER and make sure we don't touch anything like
vmalloc space after that. On entry back to kernel mode, we would avoid vmalloc
space, etc, then atomically switch to kernel mode and read out whatever
deferred work is needed. As an optimization, if nothing in the current
configuration needs atomic state tracking, the state could be left at
USER_OR_KERNEL and the overhead of an extra atomic op at entry and exit could
be avoided.
And RCU would hook into *that* instead of having its own separate set of hooks.
I think that actually doing this would be a big improvement and would also be a
serious project. There's a lot of code that would get touched, and the
existing context tracking code is subtle and confusing. And, as mentioned,
ti_flags has the wrong scope.
It's *possible* that one could avoid making ti_flags percpu either by extensive
use of the runqueue locks or by borrowing a kludge from the idle code. For the
latter, right now, the reason that the wake-from-idle code works is that the
optimized path only happens if the idle thread/cpu is "polling", and it's
impossible for the idle ti_flags to be polling while the CPU isn't actually
idle. We could similarly observe that, if a ti_flags says it's in USER mode
*and* is on, say, cpu 3, then cpu 3 is most definitely in USER mode. So
someone could try shoving the CPU number into ti_flags :-p (USER means
actually user or in the late exit / early entry path.)
Anyway, benefits of this whole approach would include considerably (IMO)
increased comprehensibility compared to the current tangled ct code and much
more straightforward addition of new things that happen to a target CPU
conditionally depending on its mode. And, if the flags word was actually per
cpu, it could be mapped such that SWITCH_TO_KERNEL_CR3 would use it -- there
could be a single CR3 write (and maybe CR4/invpcid depending on whether a
zapped mapping is global) and the flush bit could depend on whether a flush is
needed. And there would be basically no chance that a bug that accessed
invalidated-but-not-flushed kernel data could be undetected -- in PTI mode, any
such access would page fault! Similarly, if kernel text pokes deferred the
flush and serialization, the only code that could execute before noticing the
deferred flush would be the user-CR3 code.
Oh, any another primitive would be possible: one CPU could plausibly execute
another CPU's interrupts or soft-irqs or whatever by taking a special lock that
would effectively pin the remote CPU in user mode -- you'd set a flag in the
target cpu_flags saying "pin in USER mode" and the transition on that CPU to
kernel mode would then spin on entry to kernel mode and wait for the lock to be
released. This could plausibly get a lot of the on_each_cpu callers to switch
over in one fell swoop: anything that needs to synchronize to the remote CPU
but does not need to poke its actual architectural state could be executed
locally while the remote CPU is pinned.
--Andy