Hi, On Mon, Mar 09, 2026 at 08:11:45PM +0000, Michael Kelly wrote: > During sbuilds of haskell packages there are dependent packages installed > that have a large installed size (ghc-doc for example is ~700M). Often > during the write of this data, the system seems to enter a blocked > state. Normal page allocation is suspended and so non-vm privileged tasks, > including ext2fs servers, soon get blocked if they require more memory. Any > process accessing file storage is also likely to block on pagein from the > stalled servers so even the console becomes unresponsive. > > The system is not actually totally stuck. Pageout processing continues at a > low level. There is no default pager running so only external pages can > considered for pageout. Appropriate memory_object_data_return requests are > issued to external pagers at the rate of approximately 100 per second. The > CPU load is so low that the virtual machine 'CPU usage' graph superficially > looks like it is zero. None of these m_o_d_r messages can be handled and > actually free pages steadily decline. > > I added some debugging to log every 100th pageout attempt from when > vm_page_alloc_paused becomes set. In one example, free pages steadily drop > from ~67500 to about ~32000 over a period of ~22minutes. Then suddenly the > pageout processing comes across a large series of pages (~38000) that can be > trivially reclaimed which are sufficient to terminate the pageout activity > and resume normal page allocation. The system becomes usable again.
Wow, cool. What exact patch did you use? > Might it be that boralus is also behaving this way without it being noticed? > The use of sync=5 might reduce the likelihood of this occurring, I'd guess, > but I have also seen this scenario occur using sync=5 myself. As a data point, the 64bit Postgres buildfarm animal VM I am running is also running without mach-defpager and with sync=5. Normal operation is pretty stable, but when I try to run the TAP tests (which create and destroy Postgres server instances at a great frequency with lots of I/O), it gets stuck pretty quickly as well. I never had the patience to let it recover by itself (assuming it was stuck for good), but I could try to reproduce it with your debugging code added. Michael
