Re: [PATCH] Prototype metadata journaling system for libdiskfs

Milos Nikic Thu, 24 Jul 2025 12:04:45 -0700

Hi all,

I wanted to share some thoughts to clarify design decisions around the
journaling prototype — especially in areas where they differ from ext3/4,
which came up in earlier discussion.

*1. Why no “begin transaction” / “end transaction” markers like ext3/4?*

*Short answer*: They’re unnecessary for what we’re doing, and not aligned
with our architecture.

*Longer explanation*: ext3/4 journaling operates at the block level, where
even a simple operation like changing a file’s mode might touch multiple
disk blocks. They use transaction markers to group these low-level updates
into an atomic unit.

In contrast, our prototype operates at a higher semantic level — above VFS
— where each logged action (e.g. “inode 1234 changed mode from 0755 to
0644”) is already atomic. There's no ambiguity or multi-step state to
bracket. Adding transaction markers in this context wouldn’t increase
correctness or recoverability — it would only waste valuable journal space.

*2. Why not scan live filesystem at boot to reconstruct paths?*

It’s true that path reconstruction is difficult during logging, since names
can change, and nodes can move or disappear. But relying on scanning after
a crash doesn’t make recovery more reliable — just more confusing.

For example:

   - If the logged inode (say, 1234) no longer exists at scan time, we’ve
   learned nothing new. No recovery can happen.
   - If it *does* exist (say, /home/user/pictures/pic.jpg), the scan
   reveals a file that has survived — but we don’t know if it’s the same
   content. Do we overwrite? Skip? Log a message?

In either case, post-crash scanning isn't actionable unless the journal
itself carries contextual information like paths, which requires capturing
them at the time of logging — not recovery.
------------------------------

*3. Spammy device nodes (/dev/random, /dev/null, etc.)*

These nodes are not just noisy in atime updates — they also frequently
change mtime and ctime. Even with relatime, these updates happen often
enough to quickly flood the journal.

Given that journal space is limited, allowing these device nodes to
monopolize space degrades the quality of actual user-facing metadata
changes. Filtering them early is a pragmatic decision that avoids noise and
preserves valuable entries.

Hopefully this clears up the rationale behind some of these design choices.
Let me know if you’d like to dive deeper into any of them.

Best,

Milos

On Mon, Jul 21, 2025, 4:04 PM Samuel Thibault <samuel.thiba...@gnu.org>
wrote:

> Milos Nikic, le lun. 21 juil. 2025 11:38:00 -0700, a ecrit:
> > > Which kind of operations is spamming? As I mentioned, we most probably
> > > want to implement relatime, that'll be useful to avoid many writes
> > > anyway.
> >
> > Mainly `utime` updates to `/dev/null` and `/dev/random`.
>
> Which would be caught by relatime.
>
> «
> Access time is only updated if the previous access time was earlier than
> or equal to the current modify or change time.
> »
>
> Better take the time to implement that, since that'll save the
> corresponding inode writes too.
>
> > > Better use the ext3/4 native way of allocating blocks for the journal.
> >
> > That’s exactly what I’d like to do next — but I’m not sure how to get
> there in
> > this context. Would this involve allocating blocks outside the main
> filesystem
> > namespace via libstore? Any pointers or examples would be really
> appreciated.
>
> No, it's still in the disk storage. It's just that ext3 has a way to
> reserve blocks for the journal. I don't know a reference for this but it
> should be easy to find.
>
> > > Does the normal path lookup not work? At worse by rearranging some code
> > > to provide an internal version not meant for RPCs.
> >
> > That’s the trick: the issue isn’t how, but *when*.
> > The journal contains information from before the crash, but after
> reboot, we’re
> > walking a post-crash live filesystem. If we try to resolve inode paths
> at boot,
> > we might end up with mismatches, or restoring paths that no longer make
> sense.
>
> But the journal is supposed to be in an order that makes sense
> sequentially. Again, better check how ext3/4/jbd are doing it, rather
> than trying to re-invent them.
>
> > One additional note: while testing i have discovered  that the filesystem
> > remains read-only at that early point and it onl stops being readonly
> after
> > the RPC come online.
> > If is just call diskfs_node_update that early (as i do in the patch) it
> > silently has no effect (!!!)
>
> You probably just want to set diskfs_readonly = 0 while playing the
> journal, and reset it to what it was (as ask on the command-line etc.)
> just before unleashing RPCs.
>
> >  On the other hand, once RPCs are up, trying to walk the FS to replay
> changes
> > risks deadlocks.
>
> Sure, you don't want that.
>
> > It feels like journaling recovery needs to happen in a carefully
> coordinated
> > phase — perhaps a new pre-init mode, or deeper integration with `diskfs`
> > itself.
>
> Yes. Feel free to add hooks if libdiskfs doesn't have what you need.
>
> Samuel
>

Re: [PATCH] Prototype metadata journaling system for libdiskfs

Reply via email to