Re: [PATCH 1/3] ext2fs: Add JBD2 on-disk layout headers

Milos Nikic Thu, 05 Mar 2026 09:32:38 -0800

Hey Samuel,

Thanks for the thorough review. I am working on a V2 patch to incorporate
all the feedback.


In the meantime, I would like you to review my mental model of how things
work in the Hurd, because it is what I based most of my decisions on for
the journaling code.

I built this mental model by reading the code, experimenting, and observing
how it runs.

Here it goes. (I will refer to libdiskfs/ext2 as the VFS below for
simplicity).

Hurd VFS works in 3 layers:

   1.

   *Node cache layer:* The abstract node lives here and it is the ground
   truth of a running file system. When one does a stat myfile.txt, we get
   the information straight from the cache. When we create a new file, it gets
   placed in the cache, etc.
   2.

   *Pager layer:* This is where nodes are serialized into the actual
   physical representation (4KB blocks) that will later be written to disk.
   3.

   *Hard drive:* The physical storage that receives the bytes from the
   pager.

During normal operations (not a sync mount, fsync, etc.), the VFS operates
almost entirely on Layer 1: The Node cache layer. This is why it's super
fast. User changed atime? No problem. It just fetches a node from the node
cache (hash table lookup, amortized to O(1)) and updates the struct in
memory. And that is it.

Only when the sync interval hits (every 30 seconds by default) does the
Node cache get iterated and serialized to the pager layer (
diskfs_sync_everything -> write_all_disknodes -> write_node -> pager_sync).
So basically, at that moment, we create a snapshot of the state of the node
cache and place it onto the pager(s).

Even then, pager_sync is called with wait = 0. It is handed to the pager,
which sends it to Mach. At some later time (seconds or so later), Mach
sends it back to the ext2 pager, which finally issues store_write to write
it to Layer 3 (The Hard drive). And even that depends on how the driver
reorders or delays it.

The effect of this architecture is that when store_write is finally
called, *the
absolute latest version of the node cache snapshot is what gets written to
the storage.* Is this basically correct? Are there any edge cases or
mechanics that are wrong in this model that would make us receive a "stale"
node cache snapshot?

Regards, Milos






On Tue, Mar 3, 2026 at 4:31 PM Samuel Thibault <[email protected]>
wrote:

> Hello,
>
> Milos Nikic, le jeu. 26 févr. 2026 16:55:41 -0800, a ecrit:
> > For instance when one is compiling Hurd.
> > I have experienced it about 8.5% slower compilation than when journal is
> not
> > present. (and ~12% slower when running ./configure for instance),
>
> I'm a bit surprised by the overhead. Notably for a compilation that does
> not write many files, just quite some data (but data is not journaled).
>
> Is that because we write e.g. directories atime more often?
>
> > but 0% slower when mass deleting files
>
> And that one I'm a bit more surprised since that's where metadata would
> be put under pressure.
>
> > Its a large patch now...let me know if you want it split up.
>
> It's fine as it is: all the plugging in libdiskfs makes sense along the
> jb implementation
>
> > diff --git a/ext2fs/dir.c b/ext2fs/dir.c
> > index 256f3d36..fd78cd7c 100644
> > --- a/ext2fs/dir.c
> > +++ b/ext2fs/dir.c
> > @@ -241,8 +241,7 @@ diskfs_lookup_hard (struct node *dp, const char
> *name, enum
> > lookup_type type,
> >      }
> >
> >    diskfs_set_node_atime (dp);
> > -  if (diskfs_synchronous)
> > -    diskfs_node_update (dp, 1);
> > +  diskfs_node_update (dp, diskfs_synchronous);
> >
> >    /* If err is set here, it's ENOENT, and we don't want to
> >       think about that as an error yet. */
>
> Do we really want to update the node? Perhaps that's what the overhead
> comes from? With journaling, we don't want to write more often. We just
> want to write to the journal before writing to the real place, to make
> sure to keep coherency, only.
>
> And similarly in other diskfs_node_update places.
>
> > -/* Sync all allocation information and node NP if diskfs_synchronous. */
> > +/* Sync all allocation information and node NP if diskfs_synchronous.
> > +   If journaling is active, we just update memory (wait=0) and let the
> > +   transaction commit handle durability. */
> >  EXT2FS_EI void
> >  alloc_sync (struct node *np)
> >  {
> > -  if (diskfs_synchronous)
> > +  if (np)
> >      {
> > -      if (np)
> > -     {
> > -       diskfs_node_update (np, 1);
> > -       pokel_sync (&diskfs_node_disknode (np)->indir_pokel, 1);
> > -     }
> > -      diskfs_set_hypermetadata (1, 0);
> > +      diskfs_node_update (np, diskfs_synchronous);
> > +      pokel_sync (&diskfs_node_disknode (np)->indir_pokel,
> diskfs_synchronous);
>
> Again, do we really want to trigger the poke?
>
> >      }
> > +  diskfs_set_hypermetadata (diskfs_synchronous, 0);
>
> (and the related hypermetadata update)
>
> >  }
> >  #endif /* Use extern inlines.  */
> >
>
>
> > diff --git a/ext2fs/inode.c b/ext2fs/inode.c
> > index 8d10af01..b22c1c4f 100644
> > --- a/ext2fs/inode.c
> > +++ b/ext2fs/inode.c
> >  /* Sync the info in NP->dn_stat and any associated format-specific
> >     information to disk.  If WAIT is true, then return only after the
> >     physicial media has been completely updated.  */
> >  void
> >  diskfs_write_disknode (struct node *np, int wait)
> >  {
> > -  struct ext2_inode *di = write_node (np);
> > +  struct ext2_inode *di;
> > +
> > +  if (ext2_journal)
> > +    {
> > +      di = write_disknode_journaled (np, wait);
>
> You can as well just include the content of write_disknode_journaled
> here, it seems to me you will be able to factorize the write_node call
> etc. and it'll be simpler in the end.
>
> Also, we are ignoring the wait flag and just always wait? Again, I don't
> see why we should be doing this. Journaling is not supposed to make
> programs wait more, but just make sure that what is on the disk is
> coherent.
>
> > +      if (di)
> > +        record_global_poke (di);
> > +      return;
> > +    }
> > +
> > +  di = write_node (np);
> >    if (di)
> >      {
> >        if (wait)
> >          {
> > -       sync_global_ptr (di, 1);
> > +          sync_global_ptr (di, 1);
> >            error_t err = store_sync (store);
> > +          /* Ignore EOPNOTSUPP (drivers), but warn on real I/O errors */
> >            if (err && err != EOPNOTSUPP)
> > -            ext2_warning ("inode flush failed: %s", strerror (err));
> > +            ext2_warning ("device flush failed: %s", strerror (err));
> >          }
> >        else
> > -     record_global_poke (di);
> > +        {
> > +          record_global_poke (di);
> > +        }
> >      }
> >  }
>
>
> > +/**
> > + * Called by the Pager (store_write hook) after writing blocks to the
> main disk.
> > + * Bulk version to handle clustered pageouts efficiently.
> > + */
> > +void
> > +journal_notify_blocks_written (block_t start_block, size_t n_blocks)
>
> Can it happen that we write several time to the same metadata block, and
> thus have several transactions in the log for the same block, and then
> we don't know if the pager write notification is only for the former or
> both? and all kinds of mixtures.
>
> This should be ordered in the journal.c file after the commiting
> function, since the pager write is supposed to happen after we have
> committed to the journal.
>
>
> > +/* Writes the Descriptor Block + All Data Blocks (Escaped) */
> > +static error_t
> > +journal_write_payload (journal_t *journal, const diskfs_transaction_t
> *txn)
> > +/* Writes the Commit Block */
> > +static error_t
> > +journal_write_commit_record (journal_t *journal,
> > +                          diskfs_transaction_t *txn, uint32_t
> commit_loc)
> > +/* Cleans up the transaction. */
> > +static error_t
> > +journal_cleanup_transaction (diskfs_transaction_t *txn, error_t err)
>
> These should be ordered in journal.c along
> journal_commit_running_transaction_locked, it's their only caller.
>
> > +/**
> > + * Ensures there is a VALID running transaction to attach to.
> > + * Returns 0 on success, or error code.
> > + */
> > +static error_t
> > +journal_start_transaction (diskfs_transaction_t **out_txn)
> > +{
> > +  diskfs_transaction_t *txn;
> > +
> > +  if (!ext2_journal)
> > +    return EINVAL;
> > +
> > +  JOURNAL_LOCK (ext2_journal);
> > +
> > +  if (ext2_journal->j_free < ext2_journal->j_min_free)
> > +    {
> > +      JRNL_LOG_DEBUG
> > +     ("[TRX] Journal full (Free: %u). Forcing checkpoint.",
> > +      ext2_journal->j_free);
> > +
> > +      journal_force_checkpoint_locked (ext2_journal);
> > +    }
> > +
> > +  txn = ext2_journal->j_running_transaction;
> > +
> > +  if (txn)
> > +    {
> > +      assert_backtrace (txn->t_state == T_RUNNING);
> > +      txn->t_updates++;
> > +    }
> > +  else
> > +    {
> > +      txn = calloc (1, sizeof (diskfs_transaction_t));
> > +      if (!txn)
> > +     {
> > +       JOURNAL_UNLOCK (ext2_journal);
> > +       return ENOMEM;
> > +     }
> > +
> > +      hurd_ihash_init (&txn->t_buffer_map, HURD_IHASH_NO_LOCP);
> > +      txn->t_tid = ext2_journal->j_transaction_sequence++;
> > +      txn->t_state = T_RUNNING;
> > +      txn->t_updates = 1;
> > +
> > +      ext2_journal->j_running_transaction = txn;
> > +      JRNL_LOG_DEBUG ("[TRX] Created NEW TID %u", txn->t_tid);
> > +    }
> > +
> > +  JOURNAL_UNLOCK (ext2_journal);
> > +  *out_txn = txn;
> > +  return 0;
> > +}
> > +
> > +diskfs_transaction_t *
> > +diskfs_journal_start_transaction (void)
> > +{
> > +  if (ext2_journal)
> > +    {
> > +      diskfs_transaction_t *real_txn;
> > +      error_t err = journal_start_transaction (&real_txn);
>
> Is it really useful to separate the two? You can just return early from
> diskfs_journal_start_transaction when !ext2_journal.
>
> > +      if (err)
> > +     return NULL;
> > +      return real_txn;
> > +    }
> > +  return NULL;
> > +}
>
> > +/**
> > + * Called by the pager BEFORE writing blocks to their permanent home.
> > + * Tries its best to ensure WAL ordering for a range of blocks.
> > + * There are cases in which that is not feasible though! (see below)
> > + */
> > +void
> > +journal_ensure_blocks_journaled (block_t start_block, size_t n_blocks)
> > +{
> > +  if (!ext2_journal || n_blocks == 0)
> > +    return;
> > +
> > +  JOURNAL_LOCK (ext2_journal);
> > +
> > +  diskfs_transaction_t *commit = ext2_journal->j_committing_transaction;
> > +  diskfs_transaction_t *run = ext2_journal->j_running_transaction;
> > +  uint32_t wait_tid = 0;
> > +  int force_commit = 0;
> > +  int in_committing = 0;
> > +
> > +  for (size_t i = 0; i < n_blocks; i++)
> > +    {
> > +      block_t b = start_block + i;
> > +
> > +      /* If ANY block is in the running transaction, we must force a
> commit. */
> > +      if (run && hurd_ihash_find (&run->t_buffer_map,
> (hurd_ihash_key_t) b))
> > +     {
> > +       force_commit = 1;
> > +       wait_tid = run->t_tid;
> > +       break;
> > +     }
> > +      /* Keep checking in case a later block is in the RUNNING state. */
> > +      else if (commit
> > +            && hurd_ihash_find (&commit->t_buffer_map,
> > +                                (hurd_ihash_key_t) b))
> > +     {
> > +       if (wait_tid == 0)
> > +         wait_tid = commit->t_tid;
> > +       in_committing = 1;
> > +     }
> > +    }
> > +
> > +  /* The libpager MUST NOT sleep waiting for a VFS thread or a journal
> thread.
> > +     If we make it sleep, we risk a system-wide deadlock.
> > +     On the other hand, if we return EAGAIN or similar, the pager
> > +     will just drop the request and the block writes will be lost.
> > +     Therefore, if the journal is not instantly ready to commit, we
> MUST bypass
> > +     the WAL barrier and write directly to the main disk. */
> > +  if (force_commit)
> > +    {
> > +      /* If VFS is mutating the transaction, or the disk pipeline is
> already full,
> > +         we cannot commit right now. Bail out and bypass! */
> > +      if (run->t_updates > 0 || commit != NULL)
> > +     {
> > +       JRNL_LOG_DEBUG
> > +         ("[WARN] VM Deadlock Hazard! Bypassing WAL for RUNNING TID %u",
> > +          wait_tid);
>
> We want more than a debug log, as this is a case were we are failing to
> provide the guarantee.
>
> > +       JOURNAL_UNLOCK (ext2_journal);
> > +       return;
> > +     }
> > +
> > +      /* It is perfectly safe. The Pager will drive the commit
> synchronously right now. */
> > +      JRNL_LOG_DEBUG ("Pager forcing synchronous commit for TID %u",
> > +                   wait_tid);
> > +      journal_commit_running_transaction_locked (ext2_journal);
> > +      return;
> > +    }
> > +  else if (in_committing)
> > +    {
> > +      /* The block is actively being written to the log by another
> thread.
> > +         We cannot wait for it. Bail out and bypass! */
> > +      JRNL_LOG_DEBUG
> > +     ("[WARN] VM Deadlock Hazard! Bypassing WAL for COMMITTING TID %u",
>
> Ditto.
>
> I'm wondering: can it not happen "often" that we have a pager write
> while the journald thread is writing? Precisely since the sync and pager
> threads are not synchronized. We could rather synchronize them, making
> sure to write the journal before synchronizing the pager.
>
> > +      wait_tid);
> > +    }
> > +
> > +  JOURNAL_UNLOCK (ext2_journal);
> > +}
>
> > diff --git a/ext2fs/pager.c b/ext2fs/pager.c
> > index 1c795784..8b58ea21 100644
> > --- a/ext2fs/pager.c
> > +++ b/ext2fs/pager.c
> > @@ -25,6 +25,7 @@
> >  #include <inttypes.h>
> >  #include <hurd/store.h>
> >  #include "ext2fs.h"
> > +#include "journal.h"
> >
> >  /* XXX */
> >  #include "../libpager/priv.h"
> > @@ -326,6 +327,9 @@ pending_blocks_write (struct pending_blocks *pb)
> >
> >        ext2_debug ("writing block %u[%ld]", pb->block, pb->num);
> >
> > +      /* Lets make sure these are all already committed. */
> > +      journal_ensure_blocks_journaled (pb->block, pb->num);
> > +
> >        if (pb->offs > 0)
> >       /* Put what we're going to write into a page-aligned buffer.  */
> >       {
> > @@ -336,6 +340,12 @@ pending_blocks_write (struct pending_blocks *pb)
> >       }
> >        else
> >       err = store_write (store, dev_block, pb->buf, length, &amount);
> > +
> > +      /* Now tell the journal about the partial/full success. */
> > +      size_t written_blocks = amount >> log2_block_size;
> > +      journal_notify_blocks_written (pb->block, written_blocks);
> > +
> > +
> >        if (err)
> >       return err;
> >        else if (amount != length)
> > @@ -477,14 +487,13 @@ file_pager_write_pages (struct node *node,
> >         blk_peek = 0;
> >       }
> >
> > +      pthread_rwlock_unlock (lock);
> >        /* Flush exactly one coalesced run; even if the loop broke early,
> >           we may have a valid prefix to push.  */
> >        error_t werr = pending_blocks_write (&pb);
> >        if (!err)
> >       err = werr;
> >
> > -      pthread_rwlock_unlock (lock);
> > -
>
> This looks like a bold unlock move?
>
> >        /* Advance only by what we actually enumerated and flushed.  */
> >        done += built;
> >
> > @@ -566,11 +575,11 @@ file_pager_write_page (struct node *node,
> vm_offset_t offset, void *buf)
> >        left -= block_size;
> >      }
> >
> > +  if (lock)
> > +    pthread_rwlock_unlock (lock);
> >    if (!err)
> >      pending_blocks_write (&pb);
> >
> > -  pthread_rwlock_unlock (&diskfs_node_disknode (node)->alloc_lock);
> > -
>
> ditto.
>
> >    return err;
> >  }
> >
> > @@ -674,8 +683,17 @@ disk_pager_write_page (vm_offset_t page, void *buf)
> >      }
> >    else
> >      {
> > +      block_t start_block = offset >> log2_block_size;
> > +      size_t n_blocks = length >> log2_block_size;
> > +      /* Ensure that we don't have these blocks in the currently
> > +       * running or committing transaction. This functiono will block
> > +       * if it needs to ensure this holds true. */
> > +      journal_ensure_blocks_journaled (start_block, n_blocks);
> > +
> >        err = store_write (store, offset >> store->log2_block_size,
> >                        buf, length, &amount);
> > +      size_t written_blocks = amount >> log2_block_size;
> > +      journal_notify_blocks_written (start_block, written_blocks);
> >        if (!err && length != amount)
> >       err = EIO;
> >      }
> > @@ -916,9 +934,12 @@ diskfs_file_update (struct node *node, int wait)
> >        ports_port_deref (pager);
> >      }
> >
> > -  pokel_sync (&diskfs_node_disknode (node)->indir_pokel, wait);
> > +  /* If there is a journal present we will not sync metadata immediately
> > +     We will let the journal do it when its ready. */
> > +  int meta_wait = ext2_journal ? 0 : wait;
> > +  pokel_sync (&diskfs_node_disknode (node)->indir_pokel, meta_wait);
> >
> > -  diskfs_node_update (node, wait);
> > +  diskfs_node_update (node, meta_wait);
> >  }
> >
> >  /* Invalidate any pager data associated with NODE.  */
> > @@ -1569,17 +1590,43 @@ diskfs_shutdown_pager (void)
> >        return 0;
> >      }
> >
> > +  journal_commit_running_transaction ();
> >    write_all_disknodes ();
> >
> >    ports_bucket_iterate (file_pager_bucket, shutdown_one);
> >
> >    /* Sync everything on the the disk pager.  */
> >    sync_global (1);
> > +  journal_quiesce_checkpoints ();
> >    store_sync (store);
> >    /* Despite the name of this function, we never actually shutdown the
> disk
> >       pager, just make sure it's synced. */
> >  }
> >
> > +static error_t
> > +journal_sync_one (void *v_p)
> > +{
> > +  struct pager *p = v_p;
> > +  pager_sync (p, 1);
> > +  return 0;
> > +}
> > +
> > +/**
> > + * Sync all the pagers synchronously, but don't call
> > + * journal_commit here. It would deadlock.
> > + **/
> > +void
> > +journal_sync_everything (void)
> > +{
> > +  write_all_disknodes ();
> > +  ports_bucket_iterate (file_pager_bucket, journal_sync_one);
> > +  sync_global (1);
> > +  error_t err = store_sync (store);
> > +  /* Ignore EOPNOTSUPP (drivers), but warn on real I/O errors */
> > +  if (err && err != EOPNOTSUPP)
> > +    ext2_warning ("device flush failed: %s", strerror (err));
> > +}
> > +
> >  /* Sync all the pagers. */
> >  void
> >  diskfs_sync_everything (int wait)
> > @@ -1591,6 +1638,9 @@ diskfs_sync_everything (int wait)
> >        return 0;
> >      }
> >
> > +  /* We only commit if there is a journal and we have a running
> transaction */
> > +  journal_commit_running_transaction ();
> > +
> >    write_all_disknodes ();
> >    ports_bucket_iterate (file_pager_bucket, sync_one);
> >
> > @@ -1599,7 +1649,6 @@ diskfs_sync_everything (int wait)
> >    if (wait)
> >      {
> >        error_t err = store_sync (store);
> > -      /* Ignore EOPNOTSUPP (drivers), but warn on real I/O errors */
> >        if (err && err != EOPNOTSUPP)
> >          ext2_warning ("device flush failed: %s", strerror (err));
> >      }
> > diff --git a/ext2fs/truncate.c b/ext2fs/truncate.c
> > index aa3a5a60..33cd9f7a 100644
> > --- a/ext2fs/truncate.c
> > +++ b/ext2fs/truncate.c
> > @@ -19,6 +19,7 @@
> >     Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. */
> >
> >  #include "ext2fs.h"
> > +#include "journal.h"
> >
> >  #ifdef DONT_CACHE_MEMORY_OBJECTS
> >  #define MAY_CACHE 0
> > @@ -330,7 +331,7 @@ diskfs_truncate (struct node *node, off_t length)
> >        diskfs_node_rdwr (node, (void *)zeroblock, length, block_size -
> offset,
> >                       1, 0, 0);
> >        /* Make sure that really happens to avoid leaks.  */
> > -      diskfs_file_update (node, 1);
> > +      diskfs_file_update (node, diskfs_synchronous);
>
> No, we never want to leave information leaks.
>
> >      }
> >
> >    ext2_discard_prealloc (node);
> > @@ -385,5 +386,6 @@ diskfs_truncate (struct node *node, off_t length)
> >
> >    pthread_rwlock_unlock (&diskfs_node_disknode (node)->alloc_lock);
> >
> > +  diskfs_node_update (node, diskfs_synchronous);
>
> We already have done one above. It's an interesting case: we really want
> a transaction in this case: either we have done the truncation, or we
> have not, and then we don't have to rely on e2fsck to finish the work.
>
> > @@ -43,24 +45,34 @@ diskfs_init_dir (struct node *dp, struct node *pdp,
> struct protid *cred)
> >    dp->dn_stat.st_nlink++;    /* for `.' */
> >    dp->dn_set_ctime = 1;
> >    err = diskfs_lookup (dp, ".", CREATE, &foo, ds, &lookupcred);
> > +  diskfs_node_update (dp, diskfs_synchronous);
> >    assert_backtrace (err == ENOENT);
> >    err = diskfs_direnter (dp, ".", dp, ds, cred);
> >    if (err)
> >      {
> >        dp->dn_stat.st_nlink--;
> >        dp->dn_set_ctime = 1;
> > +      diskfs_node_update (dp, diskfs_synchronous);
>
> Again, no, we don't want to introduce new synchronization points. We
> just want to sequentialize writing to the journal and actually writing
> the data. And the same in the further places.
>
> Samuel
>

Re: [PATCH 1/3] ext2fs: Add JBD2 on-disk layout headers

Reply via email to