Re: [PATCH] Added bulk page write to libpager and ext2fs

Milos Nikic Sun, 31 Aug 2025 20:59:50 -0700

Thanks

I think I fixed it all now.


1) removed all the changed whitespace
2) added blk_peek across the outer loop.
3) now not breaking on  if (err || built == 0) but just on  if (err)
4) added assert
5) added new numbers in the commit message (thanks for pointing that out)

Is there anything else that needs changes here?

Thanks in advance,
Milos

On Sun, Aug 31, 2025 at 2:19 PM Samuel Thibault <samuel.thiba...@gnu.org>
wrote:

> Hello,
>
> Almost there...
>
> Milos Nikic, le ven. 29 août 2025 08:16:26 -0700, a ecrit:
> > From af5cbb943d03c79e1713edcc045ecfa680007fc9 Mon Sep 17 00:00:00 2001
> > From: Milos Nikic <nikic.mi...@gmail.com>
> > Date: Mon, 25 Aug 2025 21:38:08 +0100
> > Subject: [PATCH] libpager, ext2fs: add bulk page write support
> >
> > Introduce a new pager_write_pages() entry point, allowing filesystems
> > to implement multi-page writes in one call.  libpager will attempt to
> > coalesce contiguous dirty pages and call this bulk interface; if it is
> > unsupported or only partially completes, the remaining pages are
> > handled by the existing per-page path.
> >
> > ext2fs now provides file_pager_write_pages(), implemented using a
> > small chunked write loop (currently 2 blocks per chunk) under
> > alloc_lock.  This reduces lock/unlock cycles, improves batching of
> > store_write() calls, and avoids starvation by yielding between chunks.
> >
> > In testing on a 128 MiB file, average store_write size improved from
> > ~4 KiB to ~8 KiB, cutting device write calls roughly in half.
>
> You want to update the figures :)
>
> > System
> > stress tests all pass without stalls or data corruption.
> >
> > Other filesystems may continue using pager_write_page() unchanged.
> > ---
> >  ext2fs/pager.c         | 128 ++++++++++++++++++++++++++++++++++++++++-
> >  libpager/Makefile      |   2 +-
> >  libpager/data-return.c |  57 +++++++++++++++---
> >  libpager/pager-bulk.c  |  37 ++++++++++++
> >  libpager/pager.h       |  10 ++++
> >  5 files changed, 224 insertions(+), 10 deletions(-)
> >  create mode 100644 libpager/pager-bulk.c
> >
> > diff --git a/ext2fs/pager.c b/ext2fs/pager.c
> > index c55107a9..a47fb381 100644
> > --- a/ext2fs/pager.c
> > +++ b/ext2fs/pager.c
> > @@ -83,11 +83,16 @@ do { pthread_spin_lock (&ext2s_pager_stats.lock);
>                    \
> >  #define STAT_INC(field) /* nop */0
> >  #endif /* STATS */
> >
> > +#ifndef EXT2_BULK_MAX_BLOCKS
> > +#define EXT2_BULK_MAX_BLOCKS 128
> > +#endif  /* EXT2_BULK_MAX_BLOCKS */
> > +
> >  static void
> >  disk_cache_info_free_push (struct disk_cache_info *p);
> >
> >  #define FREE_PAGE_BUFS 24
> >
> > +
> >  /* Returns a single page page-aligned buffer.  */
> >  static void *
> >  get_page_buf (void)
> > @@ -336,7 +341,6 @@ pending_blocks_write (struct pending_blocks *pb)
> >       return err;
> >        else if (amount != length)
> >       return EIO;
> > -
> >        pb->offs += length;
> >        pb->num = 0;
> >      }
>
> Please avoid unrelated whitespace changes.
>
> > @@ -378,6 +382,128 @@ pending_blocks_add (struct pending_blocks *pb,
> block_t block)
> >    pb->num++;
> >    return 0;
> >  }
> > +
> > +/* Bulk write across [offset, offset + length).  We coalesce strictly
> > +   consecutive blocks into a single device write.  The run ends on the
> > +   first non-consecutive block or when we hit the configured cap.
> > +   We report partial progress via *written (page-aligned).  */
> > +static error_t
> > +file_pager_write_pages (struct node *node,
> > +                        vm_offset_t offset,
> > +                        vm_address_t buf,
> > +                        vm_size_t length,
> > +                        vm_size_t *written)
> > +{
> > +  error_t err = 0;
> > +  vm_size_t done = 0;   /* bytes successfully issued to the store */
> > +  pthread_rwlock_t *lock = &diskfs_node_disknode (node)->alloc_lock;
> > +  const unsigned max_blocks = EXT2_BULK_MAX_BLOCKS;
> > +
> > +  if (written)
> > +    *written = 0;
> > +
> > +  while (done < length)
> > +    {
> > +      /* Acquire the same lock the per-page path uses.  We keep it
> > +         for the short time we enumerate a *single* run and flush it.
> */
> > +      pthread_rwlock_rdlock (lock);
> > +
> > +      /* Recompute clipping against allocsize each iteration.  */
> > +      vm_size_t left = length - done;
> > +      if (offset + done >= node->allocsize)
> > +        {
> > +          pthread_rwlock_unlock (lock);
> > +          break;
> > +        }
> > +      if (offset + done + left > node->allocsize)
> > +        left = node->allocsize - (offset + done);
> > +      if (left == 0)
> > +        {
> > +          pthread_rwlock_unlock (lock);
> > +          break;
> > +        }
> > +
> > +      /* Build one run of strictly consecutive on-disk blocks.  */
> > +      struct pending_blocks pb;
> > +      vm_size_t built = 0;
> > +      vm_size_t blocks_built = 0;
> > +      block_t prev = 0, blk = 0;
> > +      int have_prev = 0;
>
> have_prev is actually duplicate with prev, since block numbers are never
> to be 0 you can directly test prev != 0
>
> > +
> > +      pending_blocks_init (&pb, (void *) (buf + done));
> > +
> > +      while (blocks_built < max_blocks && built < left)
> > +        {
> > +          error_t ferr = find_block (node, offset + done + built, &blk,
> &lock);
> > +          if (ferr)
> > +            {
> > +              err = ferr;
> > +              break;
> > +            }
> > +
> > +          assert_backtrace (blk);
> > +
> > +          if (have_prev && blk != prev + 1)
> > +            break; /* non-consecutive block end of this coalesced run */
>
> But in this case blk is already allocated, so we are leaking it.
>
> I guess simplest might be to have the blk variable kept across the main
> while loop, and you can put a if (!blk) before the find_block call, and
> blk = 0 at the end of the inside while loop. You can also put an
> assert (blk == 0); at the end of the function to make sure that we
> didn't miss a leak.
>
> > +
> > +          /* Extend the run by one block.  */
> > +          ferr = pending_blocks_add (&pb, blk);
> > +          if (ferr)
> > +            {
> > +              err = ferr;
> > +              break;
> > +            }
> > +
> > +          prev = blk;
> > +          have_prev = 1;
> > +          built += block_size;
> > +          blocks_built++;
> > +        }
> > +
> > +      /* Flush exactly one coalesced run; even if the loop broke early,
> > +         we may have a valid prefix to push.  */
> > +      error_t werr = pending_blocks_write (&pb);
> > +      if (!err)
> > +        err = werr;
> > +
> > +      pthread_rwlock_unlock (lock);
> > +
> > +      /* Advance only by what we actually enumerated and flushed.  */
> > +      done += built;
> > +
> > +      /* Stop on error or if we could not make any progress.  */
> > +      if (err || built == 0)
>
> When might we make no progress without errors?
>
> Samuel
>

From 6284f5f714c8a5bd0d572d1833108b2b0d1c9405 Mon Sep 17 00:00:00 2001
From: Milos Nikic <nikic.mi...@gmail.com>
Date: Mon, 25 Aug 2025 21:38:08 +0100
Subject: [PATCH] libpager, ext2fs: add bulk page write support

Introduce a new pager_write_pages() entry point, allowing filesystems
to implement multi-page writes in one call.  libpager will attempt to
coalesce contiguous dirty pages and call this bulk interface; if it is
unsupported or only partially completes, the remaining pages are
handled by the existing per-page path.

ext2fs now provides file_pager_write_pages(), implemented using a
small chunked write loop (currently 2 blocks per chunk) under
alloc_lock.  This reduces lock/unlock cycles, improves batching of
store_write() calls, and avoids starvation by yielding between chunks.

Other filesystems may continue using pager_write_page() unchanged.

Test write 128 MiB dirty file:
Before this change:
store_write calls: 32,874; avg: 4,096 B

After this change (coalesced runs, cap=128):
store_write calls: 1,300; avg 103 KiB
(~25 times fewer calls, ~25 times larger writes)
---
 ext2fs/pager.c         | 147 +++++++++++++++++++++++++++++++++++++++++
 libpager/Makefile      |   2 +-
 libpager/data-return.c |  53 +++++++++++++--
 libpager/pager-bulk.c  |  37 +++++++++++
 libpager/pager.h       |  10 +++
 5 files changed, 241 insertions(+), 8 deletions(-)
 create mode 100644 libpager/pager-bulk.c

diff --git a/ext2fs/pager.c b/ext2fs/pager.c
index c55107a9..a7801bea 100644
--- a/ext2fs/pager.c
+++ b/ext2fs/pager.c
@@ -83,6 +83,10 @@ do { pthread_spin_lock (&ext2s_pager_stats.lock);			      \
 #define STAT_INC(field) /* nop */0
 #endif /* STATS */
 
+#ifndef EXT2_BULK_MAX_BLOCKS
+#define EXT2_BULK_MAX_BLOCKS 128
+#endif  /* EXT2_BULK_MAX_BLOCKS */
+
 static void
 disk_cache_info_free_push (struct disk_cache_info *p);
 
@@ -378,6 +382,149 @@ pending_blocks_add (struct pending_blocks *pb, block_t block)
   pb->num++;
   return 0;
 }
+
+/* Bulk write across [offset, offset + length).  We coalesce strictly
+   consecutive blocks into a single device write.  The run ends on the
+   first non-consecutive block or when we hit the configured cap.
+   We report partial progress via *written (page-aligned).  */
+static error_t
+file_pager_write_pages (struct node *node,
+			vm_offset_t offset,
+			vm_address_t buf,
+			vm_size_t length,
+			vm_size_t *written)
+{
+  error_t err = 0;
+  vm_size_t done = 0;		/* bytes successfully issued to the store */
+  pthread_rwlock_t *lock = &diskfs_node_disknode (node)->alloc_lock;
+  const unsigned max_blocks = EXT2_BULK_MAX_BLOCKS;
+
+  /* Persisted lookahead block across runs: if non-zero, it is the first
+     block of the next coalesced run and we won't re-find it.  */
+  block_t blk_peek = 0;
+
+  if (written)
+    *written = 0;
+
+  while (done < length)
+    {
+      pthread_rwlock_rdlock (lock);
+
+      /* Recompute clipping against allocsize each iteration.  */
+      vm_size_t left = length - done;
+      if (offset + done >= node->allocsize)
+	{
+	  pthread_rwlock_unlock (lock);
+	  break;
+	}
+      if (offset + done + left > node->allocsize)
+	left = node->allocsize - (offset + done);
+      if (left == 0)
+	{
+	  pthread_rwlock_unlock (lock);
+	  break;
+	}
+
+      /* Build one run of strictly consecutive on-disk blocks.  */
+      struct pending_blocks pb;
+      vm_size_t built = 0;
+      vm_size_t blocks_built = 0;
+      block_t prev = 0;
+      block_t blk = 0;
+
+      pending_blocks_init (&pb, (void *) (buf + done));
+
+      while (blocks_built < max_blocks && built < left)
+	{
+	  /* Use peeked block if we have one; otherwise look it up.  */
+	  if (blk_peek)
+	    {
+	      blk = blk_peek;
+	    }
+	  else
+	    {
+	      error_t ferr =
+		find_block (node, offset + done + built, &blk, &lock);
+	      if (ferr)
+		{
+		  err = ferr;
+		  break;
+		}
+	    }
+
+	  assert_backtrace (blk);
+
+	  if (prev != 0 && blk != prev + 1)
+	    {
+	      /* Non-consecutive: keep BLK for the next outer run.  */
+	      blk_peek = blk;
+	      break;
+	    }
+
+	  /* Extend the run by one block.  We are consuming BLK now.  */
+	  error_t ferr = pending_blocks_add (&pb, blk);
+	  if (ferr)
+	    {
+	      err = ferr;
+	      break;
+	    }
+
+	  prev = blk;
+	  built += block_size;
+	  blocks_built++;
+
+	  /* We consumed the peeked block; clear it so we fetch the next.  */
+	  blk_peek = 0;
+	}
+
+      /* Flush exactly one coalesced run; even if the loop broke early,
+         we may have a valid prefix to push.  */
+      error_t werr = pending_blocks_write (&pb);
+      if (!err)
+	err = werr;
+
+      pthread_rwlock_unlock (lock);
+
+      /* Advance only by what we actually enumerated and flushed.  */
+      done += built;
+
+      /* Stop on error. */
+      if (err)
+	break;
+    }
+
+  /* We must not be "holding" a peeked block on return unless we stopped
+     due to an error. */
+  assert_backtrace (blk_peek == 0 || err);
+
+  if (written)
+    {
+      vm_size_t w = done;
+      if (w > length)
+	w = length;
+      /* libpager expects progress aligned to whole pages.  */
+      w -= (w % vm_page_size);
+      *written = w;
+    }
+
+  return err;
+}
+
+/* Strong override: only FILE_DATA uses bulk; others keep per-page path.  */
+error_t
+pager_write_pages (struct user_pager_info *pager,
+                   vm_offset_t offset,
+                   vm_address_t data,
+                   vm_size_t length,
+                   vm_size_t *written)
+{
+  /* libpager will just hand this off to the pager_write_page. */
+  if (pager->type != FILE_DATA)
+    return EOPNOTSUPP;
+
+  return file_pager_write_pages (pager->node, offset, data, length, written);
+}
+
 
 /* Write one page for the pager backing NODE, at OFFSET, into BUF.  This
    may need to write several filesystem blocks to satisfy one page, and tries
diff --git a/libpager/Makefile b/libpager/Makefile
index 06fcb96b..169d5ab1 100644
--- a/libpager/Makefile
+++ b/libpager/Makefile
@@ -24,7 +24,7 @@ SRCS = data-request.c data-return.c data-unlock.c pager-port.c \
 	pager-create.c pager-flush.c pager-shutdown.c pager-sync.c \
 	stubs.c demuxer.c chg-compl.c pager-attr.c clean.c \
 	dropweak.c get-upi.c pager-memcpy.c pager-return.c \
-	offer-page.c pager-ro-port.c
+	offer-page.c pager-ro-port.c pager-bulk.c
 installhdrs = pager.h
 
 HURDLIBS= ports
diff --git a/libpager/data-return.c b/libpager/data-return.c
index a69a2c5c..1416ae41 100644
--- a/libpager/data-return.c
+++ b/libpager/data-return.c
@@ -158,14 +158,53 @@ _pager_do_write_request (struct pager *p,
   /* Let someone else in. */
   pthread_mutex_unlock (&p->interlock);
 
-  /* This is inefficient; we should send all the pages to the device at once
-     but until the pager library interface is changed, this will have to do. */
+  int i_page = 0;
+  while (i_page < npages)
+    {
+      if (omitdata & (1U << i_page))
+	{
+	  i_page++;
+	  continue;
+	}
+
+      /* Find maximal contiguous run [i_page, j_page) with no omitdata.  */
+      int j_page = i_page + 1;
+      while (j_page < npages && ! (omitdata & (1U << j_page)))
+	j_page++;
+
+      vm_offset_t run_off = offset + (vm_page_size * i_page);
+      vm_address_t run_ptr = data + (vm_page_size * i_page);
+      vm_size_t run_len = vm_page_size * (j_page - i_page);
+
+      vm_size_t wrote = 0;
+
+      /* Attempt bulk write.  */
+      error_t berr = pager_write_pages (p->upi, run_off, run_ptr,
+					run_len, &wrote);
+
+      /* How many pages did bulk actually complete? (only if not EOPNOTSUPP) */
+      int pages_done = 0;
+      if (berr != EOPNOTSUPP)
+	{
+	  if (wrote > run_len)
+	    wrote = run_len;
+	  wrote -= (wrote % vm_page_size);
+	  pages_done = wrote / vm_page_size;
+	}
+
+      /* Mark successful prefix (if any).  */
+      for (int k = 0; k < pages_done; k++)
+	pagerrs[i_page + k] = 0;
+
+      /* Per-page the remaining suffix of the run, or the whole run if unsupported.  */
+      for (int k = i_page + pages_done; k < j_page; k++)
+	  pagerrs[k] = pager_write_page (p->upi,
+					 offset + (vm_page_size * k),
+					 data + (vm_page_size * k));
+
+      i_page = j_page;
+    }
 
-  for (i = 0; i < npages; i++)
-    if (!(omitdata & (1U << i)))
-      pagerrs[i] = pager_write_page (p->upi,
-				     offset + (vm_page_size * i),
-				     data + (vm_page_size * i));
 
   /* Acquire the right to meddle with the pagemap */
   pthread_mutex_lock (&p->interlock);
diff --git a/libpager/pager-bulk.c b/libpager/pager-bulk.c
new file mode 100644
index 00000000..fcceeb43
--- /dev/null
+++ b/libpager/pager-bulk.c
@@ -0,0 +1,37 @@
+/* pager-bulk.c Default (dummy) implementation of bulk page write.
+  
+Copyright (C) 2025 Free Software Foundation, Inc.
+Written by Milos Nikic.
+
+This file is part of the GNU Hurd.
+  
+The GNU Hurd is free software; you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation; either version 2, or (at your option)
+any later version.
+
+The GNU Hurd is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with the GNU Hurd; if not, see <https://www.gnu.org/licenses/>.  */
+
+#include <libpager/pager.h>
+#include "priv.h"
+
+/* Default dummy implementation of pager_write_pages. */
+__attribute__((weak)) error_t
+pager_write_pages (struct user_pager_info *upi,
+		   vm_offset_t offset,
+		   vm_address_t data, vm_size_t length, vm_size_t *written)
+{
+  (void) upi;
+  (void) offset;
+  (void) data;
+  (void) length;
+  if (written)
+    *written = 0;
+  return EOPNOTSUPP;
+}
diff --git a/libpager/pager.h b/libpager/pager.h
index 3b1c7251..8c43ad0e 100644
--- a/libpager/pager.h
+++ b/libpager/pager.h
@@ -203,6 +203,16 @@ pager_write_page (struct user_pager_info *pager,
 		  vm_offset_t page,
 		  vm_address_t buf);
 
+/* The user may define this function.  For pager PAGER, synchronously
+   write potentially multiple pages from DATA to offset.
+   Do not deallocate DATA, and do not keep any references to DATA.
+   The only permissible error returns are EIO, EDQUOT, EOPNOTSUPP, and ENOSPC. */
+error_t pager_write_pages(struct user_pager_info *upi,
+                          vm_offset_t offset,
+                          vm_address_t data,
+                          vm_size_t length,
+                          vm_size_t *written);
+
 /* The user must define this function.  A page should be made writable. */
 error_t
 pager_unlock_page (struct user_pager_info *pager,
-- 
2.50.1

Re: [PATCH] Added bulk page write to libpager and ext2fs

Reply via email to