kernel transition

Andy Lutomirski Fri, 14 Nov 2025 08:29:23 -0800

On Fri, Nov 14, 2025, at 7:01 AM, Valentin Schneider wrote:
> Context
> =======
>
> We've observed within Red Hat that isolated, NOHZ_FULL CPUs running a
> pure-userspace application get regularly interrupted by IPIs sent from
> housekeeping CPUs. Those IPIs are caused by activity on the housekeeping CPUs
> leading to various on_each_cpu() calls, e.g.:
>

> The heart of this series is the thought that while we cannot remove NOHZ_FULL
> CPUs from the list of CPUs targeted by these IPIs, they may not have to 
> execute
> the callbacks immediately. Anything that only affects kernelspace can wait
> until the next user->kernel transition, providing it can be executed "early
> enough" in the entry code.
>

I want to point out that there's another option here, although anyone trying to 
implement it would be fighting against quite a lot of history.

Logically, each CPU is in one of a handful of states: user mode, idle, normal 
kernel mode (possibly subdivided into IRQ, etc), and a handful of very narrow 
windows, hopefully uninstrumented and not accessing any PTEs that might be 
invalid, in the entry and exit paths where any state in memory could be out of 
sync with actual CPU state.  (The latter includes right after the CPU switches 
to kernel mode, for example.)  And NMI and MCE and whatever weird "security" 
entry types that Intel and AMD love to add.

The way the kernel *currently* deals with this has two big historical oddities:

1. The entry and exit code cares about ti_flags, which is per-*task*, which 
means that atomically poking it from other CPUs involves the runqueue lock or 
other shenanigans (see the idle nr_polling code for example), and also that 
it's not accessible from the user page tables if PTI is on.

2. The actual heavyweight atomic part (context tracking) was built for RCU, and 
it's sort or bolted on, and, as you've observed in this series, it's really 
quite awkward to do things that aren't RCU using context tracking.

If this were a greenfield project, I think there's a straightforward approach 
that's much nicer: stick everything into a single percpu flags structure.  
Imagine we have cpu_flags, which tracks both the current state of the CPU and 
what work needs to be done on state changes.  On exit to user mode, we would 
atomically set the mode to USER and make sure we don't touch anything like 
vmalloc space after that.  On entry back to kernel mode, we would avoid vmalloc 
space, etc, then atomically switch to kernel mode and read out whatever 
deferred work is needed.  As an optimization, if nothing in the current 
configuration needs atomic state tracking, the state could be left at 
USER_OR_KERNEL and the overhead of an extra atomic op at entry and exit could 
be avoided.

And RCU would hook into *that* instead of having its own separate set of hooks.

I think that actually doing this would be a big improvement and would also be a 
serious project.  There's a lot of code that would get touched, and the 
existing context tracking code is subtle and confusing.  And, as mentioned, 
ti_flags has the wrong scope.

It's *possible* that one could avoid making ti_flags percpu either by extensive 
use of the runqueue locks or by borrowing a kludge from the idle code.  For the 
latter, right now, the reason that the wake-from-idle code works is that the 
optimized path only happens if the idle thread/cpu is "polling", and it's 
impossible for the idle ti_flags to be polling while the CPU isn't actually 
idle.  We could similarly observe that, if a ti_flags says it's in USER mode 
*and* is on, say, cpu 3, then cpu 3 is most definitely in USER mode.  So 
someone could try shoving the CPU number into ti_flags :-p   (USER means 
actually user or in the late exit / early entry path.)

Anyway, benefits of this whole approach would include considerably (IMO) 
increased comprehensibility compared to the current tangled ct code and much 
more straightforward addition of new things that happen to a target CPU 
conditionally depending on its mode.  And, if the flags word was actually per 
cpu, it could be mapped such that SWITCH_TO_KERNEL_CR3 would use it -- there 
could be a single CR3 write (and maybe CR4/invpcid depending on whether a 
zapped mapping is global) and the flush bit could depend on whether a flush is 
needed.  And there would be basically no chance that a bug that accessed 
invalidated-but-not-flushed kernel data could be undetected -- in PTI mode, any 
such access would page fault!  Similarly, if kernel text pokes deferred the 
flush and serialization, the only code that could execute before noticing the 
deferred flush would be the user-CR3 code.

Oh, any another primitive would be possible: one CPU could plausibly execute 
another CPU's interrupts or soft-irqs or whatever by taking a special lock that 
would effectively pin the remote CPU in user mode -- you'd set a flag in the 
target cpu_flags saying "pin in USER mode" and the transition on that CPU to 
kernel mode would then spin on entry to kernel mode and wait for the lock to be 
released.  This could plausibly get a lot of the on_each_cpu callers to switch 
over in one fell swoop: anything that needs to synchronize to the remote CPU 
but does not need to poke its actual architectural state could be executed 
locally while the remote CPU is pinned.

--Andy
Re: [PATCH v7 00/31] context_tracking,x86: Defer some IPIs until a user->kernel transition

Reply via email to