Re: [PATCH 00/17] Paravirt CPUs and push task for less vCPU preemption

Shrikanth Hegde Thu, 04 Dec 2025 21:31:18 -0800



On 12/4/25 6:58 PM, Ilya Leoshkevich wrote:

On Wed, 2025-11-19 at 18:14 +0530, Shrikanth Hegde wrote:

Detailed problem statement and some of the implementation choices
were
discussed earlier[1].

[1]:
https://lore.kernel.org/all/[email protected]/

This is likely the version which would be used for LPC2025 discussion
on
this topic. Feel free to provide your suggestion and hoping for a
solution
that works for different architectures and it's use cases.

All the existing alternatives such as cpu hotplug, creating isolated
partitions etc break the user affinity. Since number of CPUs to use
change
depending on the steal time, it is not driven by User. Hence it would
be
wrong to break the affinity. This series allows if the task is pinned
only paravirt CPUs, it will continue running there.

Changes compared v3[1]:

- Introduced computation of steal time in powerpc code.
- Derive number of CPUs to use and mark the remaining as paravirt
based
   on steal values.
- Provide debugfs knobs to alter how steal time values being used.
- Removed static key check for paravirt CPUs (Yury)
- Removed preempt_disable/enable while calling stopper (Prateek)
- Made select_idle_sibling and friends aware of paravirt CPUs.
- Removed 3 unused schedstat fields and introduced 2 related to
paravirt
   handling.
- Handled nohz_full case by enabling tick on it when there is CFS/RT
on
   it.
- Updated helper patch to override arch behaviour for easier
debugging
   during development.
- Kept

Changes compared to v4[2]:
- Last two patches were sent out separate instead of being with
series.
   That created confusion. Those two patches are debug patches one can
   make use to check functionality across acrhitectures. Sorry about
   that.
- Use DEVICE_ATTR_RW instead (greg)
- Made it as PATCH since arch specific handling completes the
   functionality.

[2]:
https://lore.kernel.org/all/[email protected]/

TODO:

- Get performance numbers on PowerPC, x86 and S390. Hopefully by next
   week. Didn't want to hold the series till then.

- The CPUs to mark as paravirt is very simple and doesn't work when
   vCPUs aren't spread out uniformly across NUMA nodes. Ideal would be
splice
   the numbers based on how many CPUs each NUMA node has. It is quite
   tricky to do specially since cpumask can be on stack too. Given
   NR_CPUS can be 8192 and nr_possible_nodes 32. Haven't got my head
into
   solving it yet. Maybe there is easier way.

- DLPAR Add/Remove needs to call init of EC/VP cores (powerpc
specific)

- Userspace tools awareness such as irqbalance.

- Delve into design of hint from Hyeprvisor(HW Hint). i.e Host
informs
   guest which/how many CPUs it has to use at this moment. This
interface
   should work across archs with each arch doing its specific
handling.

- Determine the default values for steal time related knobs
   empirically and document them.

- Need to check safety against CPU hotplug specially in
process_steal.


Applies cleanly on tip/master:
commit c2ef745151b21d4dcc4b29a1eabf1096f5ba544b


Thanks to srikar for providing the initial code around powerpc steal
time handling code. Thanks to all who went through and provided
reviews.

PS: I haven't found a better name. Please suggest if you have any.

Shrikanth Hegde (17):
   sched/docs: Document cpu_paravirt_mask and Paravirt CPU concept
   cpumask: Introduce cpu_paravirt_mask
   sched/core: Dont allow to use CPU marked as paravirt
   sched/debug: Remove unused schedstats
   sched/fair: Add paravirt movements for proc sched file
   sched/fair: Pass current cpu in select_idle_sibling
   sched/fair: Don't consider paravirt CPUs for wakeup and load
balance
   sched/rt: Don't select paravirt CPU for wakeup and push/pull rt
task
   sched/core: Add support for nohz_full CPUs
   sched/core: Push current task from paravirt CPU
   sysfs: Add paravirt CPU file
   powerpc: method to initialize ec and vp cores
   powerpc: enable/disable paravirt CPUs based on steal time
   powerpc: process steal values at fixed intervals
   powerpc: add debugfs file for controlling handling on steal values
   sysfs: Provide write method for paravirt
   sysfs: disable arch handling if paravirt file being written

  .../ABI/testing/sysfs-devices-system-cpu      |   9 +
  Documentation/scheduler/sched-arch.rst        |  37 +++
  arch/powerpc/include/asm/smp.h                |   1 +
  arch/powerpc/kernel/smp.c                     |   1 +
  arch/powerpc/platforms/pseries/lpar.c         | 223
++++++++++++++++++
  arch/powerpc/platforms/pseries/pseries.h      |   1 +
  drivers/base/cpu.c                            |  59 +++++
  include/linux/cpumask.h                       |  20 ++
  include/linux/sched.h                         |   9 +-
  kernel/sched/core.c                           | 106 ++++++++-
  kernel/sched/debug.c                          |   5 +-
  kernel/sched/fair.c                           |  42 +++-
  kernel/sched/rt.c                             |  11 +-
  kernel/sched/sched.h                          |   9 +
  14 files changed, 519 insertions(+), 14 deletions(-)


The capability to temporarily exclude CPUs from scheduling might be
beneficial for s390x, where users often run Linux using a proprietary
hypervisor called PR/SM and with high overcommit. In these
circumstances virtual CPUs may not be scheduled by a hypervisor for a
very long time.

Today we have an upstream feature called "Hiperdispatch", which
determines that this is about to happen and uses Capacity Aware
Scheduling to prevent processes from being placed on the affected CPUs.
However, at least when used for this purpose, Capacity Aware Scheduling
is best effort and fails to move tasks away from the affected CPUs
under high load.

Therefore I have decided to smoke test this series.

For the purposes of smoke testing, I set up a number of KVM virtual
machines and start the same benchmark inside each one. Then I collect
and compare the aggregate throughput numbers. I have not done testing
with PR/SM yet, but I plan to do this and report back. I also have not
tested this with VMs that are not 100% utilized yet.


Best results would be when it works as HW hint from hypervisor.

Benchmark parameters:

$ sysbench cpu run --threads=$(nproc) --time=10
$ schbench -r 10 --json --no-locking
$ hackbench --groups 10 --process --loops 5000
$ pgbench -h $WORKDIR --client=$(nproc) --time=10

Figures:

s390x (16 host CPUs):

Benchmark      #VMs    #CPUs/VM  ΔRPS (%)
-----------  ------  ----------  ----------
hackbench        16           4  60.58%
pgbench          16           4  50.01%
hackbench         8           8  46.18%
hackbench         4           8  43.54%
hackbench         2          16  43.23%
hackbench        12           4  42.92%
hackbench         8           4  35.53%
hackbench         4          16  30.98%
pgbench          12           4  18.41%
hackbench         2          24  7.32%
pgbench           8           4  6.84%
pgbench           2          24  3.38%
pgbench           2          16  3.02%
pgbench           4          16  2.08%
hackbench         2          32  1.46%
pgbench           4           8  1.30%
schbench          2          16  0.72%
schbench          4           8  -0.09%
schbench          4           4  -0.20%
schbench          8           8  -0.41%
sysbench          8           4  -0.46%
sysbench          4           8  -0.53%
schbench          8           4  -0.65%
sysbench          2          16  -0.76%
schbench          2           8  -0.77%
sysbench          8           8  -1.72%
schbench          2          24  -1.98%
schbench         12           4  -2.03%
sysbench         12           4  -2.13%
pgbench           2          32  -3.15%
sysbench         16           4  -3.17%
schbench         16           4  -3.50%
sysbench          2           8  -4.01%
pgbench           8           8  -4.10%
schbench          4          16  -5.93%
sysbench          4           4  -5.94%
pgbench           2           4  -6.40%
hackbench         2           8  -10.04%
hackbench         4           4  -10.91%
pgbench           4           4  -11.05%
sysbench          2          24  -13.07%
sysbench          4          16  -13.59%
hackbench         2           4  -13.96%
pgbench           2           8  -16.16%
schbench          2           4  -24.14%
schbench          2          32  -24.25%
sysbench          2           4  -24.98%
sysbench          2          32  -32.84%

x86_64 (32 host CPUs):

Benchmark      #VMs    #CPUs/VM  ΔRPS (%)
-----------  ------  ----------  ----------
hackbench         4          32  87.02%
hackbench         8          16  48.45%
hackbench         4          24  47.95%
hackbench         2           8  42.74%
hackbench         2          32  34.90%
pgbench          16           8  27.87%
pgbench          12           8  25.17%
hackbench         8           8  24.92%
hackbench        16           8  22.41%
hackbench        16           4  20.83%
pgbench           8          16  20.40%
hackbench        12           8  20.37%
hackbench         4          16  20.36%
pgbench          16           4  16.60%
pgbench           8           8  14.92%
hackbench        12           4  14.49%
pgbench           4          32  9.49%
pgbench           2          32  7.26%
hackbench         2          24  6.54%
pgbench           4           4  4.67%
pgbench           8           4  3.24%
pgbench          12           4  2.66%
hackbench         4           8  2.53%
pgbench           4           8  1.96%
hackbench         2          16  1.93%
schbench          4          32  1.24%
pgbench           2           8  0.82%
schbench          4           4  0.69%
schbench          2          32  0.44%
schbench          2          16  0.25%
schbench         12           8  -0.02%
sysbench          2           4  -0.02%
schbench          4          24  -0.12%
sysbench          2          16  -0.17%
schbench         12           4  -0.18%
schbench          2           4  -0.19%
sysbench          4           8  -0.23%
schbench          8           4  -0.24%
sysbench          2           8  -0.24%
schbench          4           8  -0.28%
sysbench          8           4  -0.30%
schbench          4          16  -0.37%
schbench          2          24  -0.39%
schbench          8          16  -0.49%
schbench          2           8  -0.67%
pgbench           4          16  -0.68%
schbench          8           8  -0.83%
sysbench          4           4  -0.92%
schbench         16           4  -0.94%
sysbench         12           4  -0.98%
sysbench          8          16  -1.52%
sysbench         16           4  -1.57%
pgbench           2           4  -1.62%
sysbench         12           8  -1.69%
schbench         16           8  -1.97%
sysbench          8           8  -2.08%
hackbench         8           4  -2.11%
pgbench           4          24  -3.20%
pgbench           2          24  -3.35%
sysbench          2          24  -3.81%
pgbench           2          16  -4.55%
sysbench          4          16  -5.10%
sysbench         16           8  -6.56%
sysbench          2          32  -8.24%
sysbench          4          32  -13.54%
sysbench          4          24  -13.62%
hackbench         2           4  -15.40%
hackbench         4           4  -17.71%

There are some huge wins, especially for hackbench, which corresponds
to Shrikanth's findings. There are some significant degradations too,
which I plan to debug. This may simply have to do with the simplistic
heuristic I am using for testing [1].


Thank you very much!! for running these numbers.

sysbench, for example, is not supposed to benefit from this series,
because it is not affected by overcommit. However, it definitely should
not degrade by 30%. Interestingly enough, this happens only with
certain combinations of VM and CPU counts, and this is reproducible.


is the host baremetal? is those cases cpufreq governer ramp up or down
might play a role. (speculating)

Initially I have seen degradations as bad as -80% with schbench. It
turned out this was caused by userspace per-CPU locking it implements;
turning it off caused the degradation to go away. To me this looks like
something synthetic and not something used by real-world application,
but please correct me if I am wrong - then this will have to be
resolved.


That's nice to hear. I was concerned with schbench rps. Now i am bit relieved.


Is this with schbench -L option?
I ran with it. and regression i was seeing earlier is gone now.


One note regarding the PARAVIRT Kconfig gating: s390x does not
select PARAVIRT today. For example, steal time we determine based on
CPU timers and clocks, and not hypervisor hints. For now I had to add
dummy paravirt headers to test this series. But I would appreciate if
Kconfig gating was removed.


Keeping PARAVIRT checks on is probably right thing. I will wait to see if
anyone objects.

Others have already commented on the naming, and I would agree that
"paravirt" is really misleading. I cannot say that the previous "cpu-
avoid" one was perfect, but it was much better.


[1] https://github.com/iii-i/linux/commits/iii/poc/cpu-avoid/v3/


Will look into it. one thing to to be careful are CPU numbers.

Re: [PATCH 00/17] Paravirt CPUs and push task for less vCPU preemption

Reply via email to