Hi,

I'm currently working on making nohz_full/nohz_idle runtime toggable
and some other people seem to be interested as well. So I've dumped
a few thoughts about some pre-requirements to achieve that for those
interested.

As you can see, there is a bit of hard work in the way. I'm iterating
that in https://pad.kernel.org/p/isolation, feel free to edit:


== RCU nocb ==

Currently controllable with "rcu_nocbs=" boot parameter and/or through 
nohz_full=/isolcpus=nohz
We need to make it toggeable at runtime. Currently handling that:
v1: https://lwn.net/Articles/820544/
v2: coming soon

== TIF_NOHZ ==

Need to get rid of that in order not to trigger syscall slowpath on CPUs that 
don't want nohz_full.
Also we don't want to iterate all threads and clear the flag when the last 
nohz_full CPU exits nohz_full
mode. Prefer static keys to call context tracking on archs. x86 does that well.

== Proper entry code ==

We must make sure that a given arch never calls exception_enter() / 
exception_exit().
This saves the previous state of context tracking and switch to kernel mode 
(from context tracking POV)
temporarily. Since this state is saved on the stack, this prevents us from 
turning off context tracking
entirely on a CPU: The tracking must be done on all CPUs and that takes some 
cycles.

This means that, considering early entry code (before the call to context 
tracking upon kernel entry,
and after the call to context tracking upon kernel exit), we must take care of 
few things:

1) Make sure early entry code can't trigger exceptions. Or if it does, the 
given exception can't schedule
or use RCU (unless it calls rcu_nmi_enter()). Otherwise the exception must call 
exception_enter()/exception_exit()
which we don't want.

2) No call to schedule_user().

3) Make sure early entry code is not interruptible or preempt_schedule_irq() 
would rely on
exception_entry()/exception_exit()

4) Make sure early entry code can't be traced (no call to 
preempt_schedule_notrace()), or if it does it
can't schedule

I believe x86 does most of that well. In the end we should remove 
exception_enter()/exit implementations
in x86 and replace it with a check that makes sure context_tracking state is 
not in USER. An arch meeting
all the above conditions would earn a CONFIG_ARCH_HAS_SANE_CONTEXT_TRACKING. 
Being able to toggle nohz_full
at runtime would depend on that.


== Cputime accounting ==

Both write and read side must switch to tick based accounting and drop the use 
of seqlock in task_cputime(),
task_gtime(), kcpustat_field(), kcpustat_cpu_fetch(). Special ordering/state 
machine is required to make that without races.

== Nohz ==

Switch from nohz_full to nohz_idle. Mind a few details:
    
    1) Turn off 1Hz offlined tick handled in housekeeping
    2) Handle tick dependencies, take care of racing CPUs setting/clearing tick 
dependency. It's much trickier when
    we switch from nohz_idle to nohz_full
    
== Unbound affinity ==

Restore kernel threads, workqueue, timers, etc... wide affinity. But take care 
of cpumasks that have been set through other
interfaces: sysfs, procfs, etc...

Reply via email to