Hi Michael,

The patch is very small but I've attached it here if it is of use to you for debugging purposes.

The diff relates to gnumach 2:1.8+git20260224-3.

The final line of the diff is something that you can do to speed up this test case. Essentially it bypasses the 50ms time between pageout scans. These are normally appropriate but in the scenario that I described, where the ext2fs pagers are unresponsive, there is no benefit in waiting to do another scan.

Regards,

Mike.

On 09/03/2026 20:35, Michael Banck wrote:
Hi,

On Mon, Mar 09, 2026 at 08:11:45PM +0000, Michael Kelly wrote:
During sbuilds of haskell packages there are dependent packages installed
that have a large installed size (ghc-doc for example is ~700M). Often
during the write of this data, the system seems to enter a blocked
state. Normal page allocation is suspended and so non-vm privileged tasks,
including ext2fs servers, soon get blocked if they require more memory. Any
process accessing file storage is also likely to block on pagein from the
stalled servers so even the console becomes unresponsive.

The system is not actually totally stuck. Pageout processing continues at a
low level. There is no default pager running so only external pages can
considered for pageout. Appropriate memory_object_data_return requests are
issued to external pagers at the rate of approximately 100 per second. The
CPU load is so low that the virtual machine 'CPU usage' graph superficially
looks like it is zero. None of these m_o_d_r messages can be handled and
actually free pages steadily decline.

I added some debugging to log every 100th pageout attempt from when
vm_page_alloc_paused becomes set. In one example, free pages steadily drop
from ~67500 to about ~32000 over a period of ~22minutes. Then suddenly the
pageout processing comes across a large series of pages (~38000) that can be
trivially reclaimed which are sufficient to terminate the pageout activity
and resume normal page allocation. The system becomes usable again.
Wow, cool. What exact patch did you use?

Might it be that boralus is also behaving this way without it being noticed?
The use of sync=5 might reduce the likelihood of this occurring, I'd guess,
but I have also seen this scenario occur using sync=5 myself.
As a data point, the 64bit Postgres buildfarm animal VM I am running is
also running without mach-defpager and with sync=5. Normal operation is
pretty stable, but when I try to run the TAP tests (which create and
destroy Postgres server instances at a great frequency with lots of
I/O), it gets stuck pretty quickly as well. I never had the patience to
let it recover by itself (assuming it was stuck for good), but I could
try to reproduce it with your debugging code added.


Michael
--- vm_page.c.orig	2026-03-09 20:53:38.923249344 +0000
+++ vm_page.c	2026-03-09 20:58:05.596385851 +0000
@@ -49,6 +49,10 @@
 #include <vm/vm_page.h>
 #include <vm/vm_pageout.h>
 
+static int DBG_total_since_pause = 0;
+static int DBG_total_reclaimed_since_pause = 0;
+int DBG_debug = 1;
+
 #define DEBUG 0
 
 #define __init
@@ -381,6 +385,8 @@
 
         if ((seg->nr_free_pages <= seg->min_free_pages)
             && current_thread() && !current_thread()->vm_privilege) {
+	    if (!vm_page_alloc_paused)
+	      DBG_total_since_pause = DBG_total_reclaimed_since_pause = 0;
             vm_page_alloc_paused = TRUE;
             return NULL;
         }
@@ -410,6 +416,8 @@
     seg->nr_free_pages -= (1 << order);
 
     if (seg->nr_free_pages < seg->min_free_pages) {
+        if (!vm_page_alloc_paused)
+	  DBG_total_since_pause = DBG_total_reclaimed_since_pause = 0;
         vm_page_alloc_paused = TRUE;
     }
 
@@ -1212,6 +1220,8 @@
         return FALSE;
     }
 
+    DBG_total_since_pause++;
+
     if (reclaim) {
         vm_page_free(page);
         vm_page_unlock_queues();
@@ -1222,6 +1232,7 @@
             vm_object_unlock(object);
         }
 
+	DBG_total_reclaimed_since_pause++;
         return TRUE;
     }
 
@@ -1248,6 +1259,12 @@
         }
     }
 
+    if (DBG_debug && (DBG_total_since_pause % 100) == 0)
+      printf("Pageout: %d: %d: %ld\n",
+	     DBG_total_reclaimed_since_pause,
+	     DBG_total_since_pause,
+	     vm_page_mem_free());
+
     vm_pageout_page(page, FALSE, TRUE); /* flush it */
     vm_object_unlock(object);
 
@@ -2047,7 +2064,7 @@
     boolean_t pause, evicted, alloc_paused;
     unsigned int i;
 
-    *should_wait = TRUE;
+    *should_wait = FALSE;
 
     simple_lock(&vm_page_queue_free_lock);
     vm_page_external_laundry_count = 0;

Reply via email to