Hi All,
With recent contributions to Hurd IRQ management I was finally able to
run GNU/Hurd on my vintage X64 hardware for the purposes of stress
testing using stress-ng. I've been running similar tests on virtual
machines over the last 6 months or so and I was interested in how stable
these tests were on standalone hardware.
It was immediately obvious that the swapping performance during the
intensive paging operations was very much reduced compared to the
virtual machine. That in itself is not surprising but the performance
was so poor that system lockups (which do occur similarly on virtual
machines) were almost immediate. In fact, I haven't been able to run
this single 2 minute test case to completion without the kernel ending
in a 'system lock' (awaiting a page in):
# stress-ng -t 2m --metrics --vm 32 --vm-bytes 1800M --mmap 32
--mmap-bytes 1800M --page-in
My machine has a traditional rotating disc and 4Gb of RAM. Running the
above on a similar sized virtual machine uses around 1.3G of swap and
does succeed approximately 90% or more of the time. My suspicion is that
the longer page in times associated with a real disc (rather than a
possibly cached virtual disc) results in a greater likelihood of system
lock. I concluded that in order to make meaningful observations of this
type of system performance on actual hardware that I needed to make some
improvements to the swapping performance.
The current page replacement policy in gnumach/vm_page is documented
within the source. It describes the policy of preferring page out of
external pages (mmap) over internal pages (anonymous memory) in order to
minimise the use of the default pager which is described as unreliable.
I've been stress testing GNU/Hurd now for quite some time and seen many
instances of system freezes but I do not recall any that there were
certainly caused by the default pager. The most common underlying cause
involves a request to an external page that cannot be progressed due to
either a deadlock situation elsewhere or assertions within the ext2fs or
storeio servers.
I have spent some time recently developing some alternative page
replacement implementations of varying complexity. One of the most
simple of these (referred to from here as 'My_patch') actually results
in very significant performance improvements generally and sufficient
improvement to allow the stress test case above to complete most times.
Before I offer this as a patch series, I'd like to present the
performance improvements it results in and a description of how it
achieves it.
I've benchmarked the following 2 test cases:
1) SNG10
This is simply a tenfold iteration of the 2 minute stress-ng test case
shown above. Whilst this is a good driver of the system to enter a heavy
paging state, it doesn't really represent anything that might normally
be run on a machine.
2) TCM3
This is a test case that might be more likely to occur normally. I found
some C++ that uses heavily templated code which results in large
compilation process sizes. Specifically, I used the code MatrixSine.cpp
which I found as example code within the libeigen package. Running 3
concurrent compilations results in swap usage of around 500M on my 4GB
test machines:
# /usr/bin/x86_64-gnu-g++-14 -I/usr/include/eigen3 -g -O2 -o
matrix_sine_1 MatrixSine.cpp &
# /usr/bin/x86_64-gnu-g++-14 -I/usr/include/eigen3 -g -O2 -o
matrix_sine_2 MatrixSine.cpp &
# /usr/bin/x86_64-gnu-g++-14 -I/usr/include/eigen3 -g -O2 -o
matrix_sine_3 MatrixSine.cpp &
These are the various machine configurations used with each test case:
1) VMHURD_REL: VM using Hurd (GNU-Mach 1.8+git20250731-8 amd64)
4096M RAM (3610M post boot), 2.8G swap
2) VMHURD_PAT: VM using Hurd (GNU-Mach 1.8+git20250731-8 amd64 + 'My_patch')
4096M RAM (3610M post boot), 2.8G swap
3) VMLINX: VM using Debian (6.12.48+deb13-amd64)
3920M RAM (run with maxcpus=1 and has 3610M post boot), 4G swap
4) HWHURD_REL: Advent hardware using Hurd (GNU-Mach 1.8+git20250731-8 amd64)
4096M (Available 3374M after boot), 10G swap
5) HWHURD_PAT: Advent hardware using Hurd (GNU-Mach 1.8+git20250731-8
amd64 + 'My_patch')
4096M (Available 3374M after boot), 10G swap
6) HWLINX: Advent hardware using Debian (6.12.48+deb13-amd64)
4096M (run with maxcpus=1 and has 3325M available after boot)
I have a number of my own local glibc, gnumach and hurd patches that fix
various bugs exposed by the stress tests but which I have not yet
submitted for merging. These do not affect swap performance.
These figures show averages for a number of runs of TCM3:
VMHURD_REL/TCM3: 11m12s (pagein=2225294,pageout=1972821)
VMHURD_PAT/TCM3: 3m07s (pagein= 179883,pageout= 281279)
HWHURD_REL/TCM3: Unable to complete any test case
HWHURD_PAT/TCM3: 8m59s (pagein=256466,pageout=373059)
HWLINX/TCM3: 2m12s (pagein=66796,pageout=236674)
The VMLINX times are significantly shorter than VMHURD_PAT but due to
differences in virtual machine optimisations it doesn't seem meaningful
to report those. The above however shows that on hardware and even with
my patched kernel that Linux is around 4 times faster than GNU/Hurd in
this test case.
The stress-ng test case metrics give an indication of the number of mmap
and vm operations completed. Here are the averaged totals for a number
of test cases:
VMHURD_REL/SNG10: (mmap) 50.1, (vm) 206217
VMHURD_PAT/SNG10: (mmap) 327.1, (vm) 840666
HWHURD_REL/SNG10: Unable to complete any test run
HWHURD_PAT/SNG10: (mmap) 183, (vm) 560018
TCM3 completes over 3 times faster with 'My_patch' and there are
approximately 4 times as many stress-ng operations completed per iteration.
All 'My_patch' actually does is to remove the restriction in always
prioritising external pages for page eviction. There are quite a few
lines of code changed but almost all of them are trivial, really. The
changes result in 2 main behavioural differences:
1) The vm_page code currently always attempts to find an external page
before looking for internal ones. I have changed a number of functions
to be told explicitly to choose either external or internal.
2) The vm_page code currently counts the number of active and inactive
pages. The counts include all external and internal pages. I have
changed the code to maintain separate counts for active_internal,
active_external, inactive_internal and inactive_external pages.
The final patch in the series uses an extremely unsophisticated
algorithm to determine what type of page to choose for evicting next. It
still chooses external pages until they represent less than 1 in 25 of
all (active or inactive) pages and at that point chooses internal. I do
not propose this as a long term strategy but simply as a starting point
for a more meaningful eviction policy. It is quite frankly ludicrously
simplistic but nevertheless seemingly effective for these test cases at
least.
There are many parts of the current implementation that are negatively
affecting performance. I have some speculative implementations that
reduce the HWHURD_PAT/TCM3 combination from the latest 9m to around 5m
but they can be discussed later if appropriate.
I'd welcome feedback on whether 'My_patch' should be submitted for
consideration.
Regards,
Mike.