Author: Matthew Dillon <dil...@apollo.backplane.com>
Date: Sat Aug 12 10:26:17 2017 -0700
kernel - Fix bottlenecks that develop when many processes are running
* When a large number of processes or threads are running (in the tens of
thousands or more), a number of O(n) or O(ncpus) bottlenecks can develop.
These bottlenecks do not develop when only a few thousand threads
By fixing these bottlenecks, and assuming kern.maxproc is autoconfigured
or manually set high enough, DFly can now handle hundreds of thousands
of active processes running, polling, sleeping, whatever.
Tested to around 400,000 discrete processes (no shared VM pages) on
a 32-thread dual-socket Xeon system. Each process is placed in a
1/10 second sleep loop using umtx timeouts:
baseline - (before changes), system bottlenecked starting
at around the 30,000 process mark, eating all
available cpu, high IPI rate from hash
collisions, and other unrelated user processes
bogged down due to the scheduling overhead.
200,000 processes - System settles down to 45% idle, and low IPI
220,000 processes - System 30% idle and low IPI rate
250,000 processes - System 0% idle and low IPI rate
300,000 processes - System 0% idle and low IPI rate.
400,000 processes - Scheduler begins to bottleneck again after the
350,000 while the process test is still in its
Once all 400,000 processes are settled down,
system behaves fairly well. 0% idle, modest
IPI rate averaging 300 IPI/sec/cpu (due to
hash collisions in the wakeup code).
* More work will be needed to better handle processes with massively
shared VM pages.
It should also be noted that the system does a *VERY* good job
allocating and releasing kernel resources during this test using
discrete processes. It can kill 400,000 processes in a few seconds
when I ^C the test.
* Change lwkt_enqueue()'s linear td_runq scan into a double-ended scan.
This bottleneck does not arise when large numbers of processes are
running in usermode, because typically only one user process per cpu
will be scheduled to LWKT.
However, this bottleneck does arise when large numbers of threads
are woken up in-kernel. While in-kernel, a thread schedules directly
to LWKT. Round-robin operation tends to result in appends to the tail
of the queue, so this optimization saves an enormous amount of cpu
time when large numbers of threads are present.
* Limit ncallout to ~5 minutes worth of ring. The calculation code is
primarily designed to allocate less space on low-memory machines,
but will also cause an excessively-sized ring to be allocated on
large-memory machines. 512MB was observed on a 32-way box.
* Remove vm_map->hint, which had basically stopped functioning in a
useful manner. Add a new vm_map hinting mechanism that caches up to
four (size, align) start addresses for vm_map_findspace(). This cache
is used to quickly index into the linear vm_map_entry list before
entering the linear search phase.
This fixes a serious bottleneck that arises due to vm_map_findspace()'s
linear scan if the vm_map_entry list when the kernel_map becomes
fragmented, typically when the machine is managing a large number of
processes or threads (in the tens of thousands or more).
This will also reduce overheads for processes with highly fragmented
* Dynamically size the action_hash array in vm/vm_page.c. This array
is used to record blocked umtx operations. The limited size of the
array could result in an excessive number of hash entries when a large
number of processes/threads are present in the system. Again, the
effect is noticed as the number of threads exceeds a few tens of
Summary of changes:
sys/kern/kern_synch.c | 6 +-
sys/kern/lwkt_thread.c | 25 ++++++-
sys/kern/subr_param.c | 7 ++
sys/vm/vm_glue.c | 10 +--
sys/vm/vm_map.c | 188 ++++++++++++++++++++++++++-----------------------
sys/vm/vm_map.h | 30 +++++++-
sys/vm/vm_page.c | 70 ++++++++++++------
7 files changed, 213 insertions(+), 123 deletions(-)
DragonFly BSD source repository