On 11/19/25 6:14 PM, Shrikanth Hegde wrote:
Detailed problem statement and some of the implementation choices were
discussed earlier[1].


Performance data on x86 and PowerPC:

++++++++++++++++++++++++++++++++++++++++++++++++
PowerPC: LPAR(VM) Running on powerVM hypervisor
++++++++++++++++++++++++++++++++++++++++++++++++

Host: 126 cores available in pool.
VM1: 96VP/64EC - 768 CPUs
VM2: 72VP/48EC - 576 CPUs
(VP- Virtual Processor core), (EC - Entitled Cores)
steal_check_frequency:1
steal_ratio_high:400
steal_ratio_low:150

Scenarios:
Secario 1: (Major improvement)
VM1 is running daytrader[1] and VM2 is running stress-ng --cpu=$(nproc)
Note: High gains. In the upstream the steal time was around 15%. With series it 
comes down
to 3%. With further tuning it could be reduced.

                                upstream                +series
daytrader                       1x                        1.7x     <<- 70% gain
throughput

-----------
Scenario 2: (improves thread_count < num_cpus)
VM1 is running schbench and VM2 is running stress-ng --cpu=$(nproc)
Note: Values are average of 5 runs and they are wakeup latencies

schbench -t 400                 upstream                +series
50.0th:                           18.00                   16.60
90.0th:                          174.00                   46.80
99.0th:                         3197.60                  928.80
99.9th:                         6203.20                 4539.20
average rps:                   39665.61                42334.65
schbench -t 600 upstream +series
50.0th:                           23.80                   19.80
90.0th:                          917.20                  439.00
99.0th:                         5582.40                 3869.60
99.9th:                         8982.40                 6574.40
average rps:                   39541.00                40018.11

-----------
Scenario 3: (Improves)
VM1 is running hackbench and VM2 is running  stress-ng --cpu=$(nproc)
Note: Values are average of 10 runs and 20000 loops.

Process 10 groups                 2.84               2.62
Process 20 groups                 5.39               4.48
Process 30 groups                 7.51               6.29
Process 40 groups                 9.88               7.42
Process 50 groups                12.46               9.54
Process 60 groups                14.76              12.09
thread  10 groups                 2.93               2.70
thread  20 groups                 5.79               4.78
Process(Pipe) 10 groups           2.31               2.18
Process(Pipe) 20 groups           3.32               3.26
Process(Pipe) 30 groups           4.19               4.14
Process(Pipe) 40 groups           5.18               5.53
Process(Pipe) 50 groups           6.57               6.80
Process(Pipe) 60 groups           8.21               8.13
thread(Pipe)  10 groups           2.42               2.24
thread(Pipe)  20 groups           3.62               3.42

-----------
Notes:

Numbers might be very favorable since VM2 is constantly running and has some 
CPUs
marked as paravirt when there is steal time and thresholds also might have 
played a role.
Will plan to run same workload i.e hackbench and schbench on both VM's and see 
the behavior.

VM1 is CPUs distributed equally across Nodes, while VM2 is not. Since CPUs are 
marked paravirt
based on core count, some nodes on VM2 would have left unused and that could 
have added a boot for
VM1 performance specially for daytrader.

[1]: Daytrader is real life benchmark which does stock trading simulation.
https://www.ibm.com/docs/en/linux-on-systems?topic=descriptions-daytrader-benchmark-application
https://cwiki.apache.org/confluence/display/GMOxDOC12/Daytrader

TODO: Get numbers with very high concurrency of hackbench/schbench.

+++++++++++++++++++++++++++++++
on x86_64 (Laptop running KVMs)
+++++++++++++++++++++++++++++++
Host: 8 CPUs.
Two VM. Each spawned with -smp 8.
-----------
Scenario 1:
Both VM's are running hackbench 10 process 10000 loops.
Values are average of 3 runs. High steal of close 50% was seen when
running upstream. So marked 4-7 as paravirt by writing to sysfs file.
Since laptop has lot of host tasks running, there will be still be steal time.

hackbench 10 groups             upstream                +series (4-7 marked as 
paravirt)
(seconds)                         58                       54.42                
        

Note: Having 5 groups helps too. But when concurrency goes such as very high(40 
groups), it regress.

-----------
Scenario 2:
Both VM's are running schbench. Values are average of 2 runs.           
"schbench -t 4 -r 30 -i 30" (latencies improve but rps is slightly less)

wakeup latencies                upstream                +series(4-7 marked as 
paravirt)
50.0th                            25.5                          13.5
90.0th                            70.0                          30.0
99.0th                          2588.0                        1992.0
99.9th                          3844.0                        6032.0
average rps:                       338                          326

schbench -t 8 -r 30 -i 30    (Major degradation of rps)
wakeup latencies                upstream                +series(4-7 marked as 
paravirt)
50.0th                            15.0                          11.5
90.0th                          1630.0                        2844.0
99.0th                          4314.0                        6624.0
99.9th                          8572.0                       10896.0
average rps:                     393                           240.5

Anything higher also regress. Need to see why it might be? Maybe too many 
context
switches since number of threads are too high and CPUs available is less.

Reply via email to