Re: [PATCH] Prototype metadata journaling system for libdiskfs

Milos Nikic Mon, 21 Jul 2025 11:38:57 -0700

Hello,

Thank you for the detailed review — I really appreciate the time and care
you’ve put into this. I’ve signed and sent the copyright assignment.

> But did you trash to hard-reboot your system during disk load, replay,
> hard-reboot during load, replay, etc. ? :)

Yes, I’ve been doing just that  :) — repeatedly hard-rebooting QEMU while
the system is under heavy write load (e.g., compiling Hurd, touching random
files in tight loops).
So far, I haven’t been able to catch the system in an inconsistent state.
The journal is intentionally very defensive during replay:
- It loads and validates the journal header first (magic, version, CRC32).
- Then loads all entries, validating each one’s magic, version, CRC, and
sanity (e.g., presence of inode and action).
- If **any** check fails (meaning literally even a single element, a single
check), the entire replay is aborted early.
- Only a fully consistent set of entries proceeds to sorting, graph
building, and restoration.

That said, more testing is certainly warranted. It’s hard to time things
*just right* to catch an fsync mid-flight, but I’ll keep experimenting.

For proper atomicity we will probably need more than just calling a
> function at a given point: rearrangement of disk writes might be needed.
>
> We'd actually better make RPCs wait for the disk writes to be flushed
> and journal space be recovered. Otherwise we suddenly lose the safety
> provided by journaling.

Yes — currently the logging code only returns once both the individual
journal entry and the header are fully fsynced.
If the journal entry isn’t on disk, the corresponding filesystem change
hasn’t happened either.
But this logic runs in a best-effort user-space context, without true
block-level atomicity guarantees. Coordinating this more tightly (perhaps
by blocking RPC returns) is something I’d like to explore further.

Which kind of operations is spamming? As I mentioned, we most probably
> want to implement relatime, that'll be useful to avoid many writes
> anyway.

Mainly `utime` updates to `/dev/null` and `/dev/random`. These are
surprisingly frequent and pollute the journal. In my next patch, I’ve
implemented automatic exclusion for such inodes — they’re detected and
ignored dynamically. (no need to configure them manually).

Better use the ext3/4 native way of allocating blocks for the journal.

That’s exactly what I’d like to do next — but I’m not sure how to get there
in this context. Would this involve allocating blocks outside the main
filesystem namespace via libstore? Any pointers or examples would be really
appreciated.

Does the normal path lookup not work? At worse by rearranging some code
> to provide an internal version not meant for RPCs.

That’s the trick: the issue isn’t how, but *when*.
The journal contains information from before the crash, but after reboot,
we’re walking a post-crash live filesystem. If we try to resolve inode
paths at boot, we might end up with mismatches, or restoring paths that no
longer make sense.

One possible path forward is to:
1. Do a full inode/path scan at boot (ino → {ino, parent, name})
2. Keep that mapping updated incrementally over time using the journal hooks
This would let us resolve ino → path accurately during live system (before
the crash/reboot) before we flush entreis to the journal — but it’s a much
larger project that I’d prefer to tackle after the core machinery is stable.

You *don't* want to make .git unsafe.

Absolutely. That was more of a fun idea than a serious proposal :)

---

One additional note: while testing i have discovered  that the filesystem
remains read-only at that early point and it onl stops being readonly
after the RPC come online.
If is just call diskfs_node_update that early (as i do in the patch) it
silently has no effect (!!!)
If I try to force mutation (e.g., setting `dn_stat_dirty = 1` and then
calling `diskfs_node_update`), it fails with "tried to write in readonly".

Basically it seem that in early boot we can scan, and we can lookup etc,
but we cannot alter anything. :(

 On the other hand, once RPCs are up, trying to walk the FS to replay
changes risks deadlocks. (calling diskfs_cached_lookup on /tmp at any point
after early boot freezes the system)
It feels like journaling recovery needs to happen in a carefully
coordinated phase — perhaps a new pre-init mode, or deeper integration with
`diskfs` itself.

Would love to hear your thoughts on how we might move toward that.

Thanks again,
Milos

On Sun, Jul 20, 2025 at 5:51 AM Samuel Thibault <samuel.thiba...@gnu.org>
wrote:

> Hello,
>
> Thanks for the update on this.
>
> Milos Nikic, le ven. 18 juil. 2025 16:00:19 -0700, a ecrit:
> > I'm now using this system routinely without issues,
>
> But did you trash to hard-reboot your system during disk load, replay,
> hard-reboot during load, replay, etc. ? :)
>
> > The journaling system is structured around a single public entry point
> > (journal.c, journal.h). All other components are internal to libdiskfs.
> The
> > patch adds around 9 new .c files and several .h headers. Existing code
> changes
> > are limited to a few hooks in strategic places.
>
> For proper atomicity we will probably need more than just calling a
> function at a given point: rearrangement of disk writes might be needed.
>
> > It is designed to fail safely: if the system runs out of memory or
> encounters a
> > journal problem, it logs a message and continues cleanly without
> affecting
> > system operation.
>
> We'd actually better make RPCs wait for the disk writes to be flushed
> and journal space be recovered. Otherwise we suddenly lose the safety
> provided by journaling.
>
> >   • Replay logic:
> >
> >       □ Metadata is only restored if the journaled update is newer than
> the
> >         current inode mtime, and the values differ.
>
> It would probably be useful to check what ext3/4/jbd do, rather than
> probably make the same mistakes they did in the past :)
>
> >   • Noise filtering:
> >
> >       □ A hardcoded inode range excludes /dev/random, /dev/null, and
> other
> >         noisy devices that would otherwise spam the journal.
>
> Which kind of operations is spamming? As I mentioned, we most probably
> want to implement relatime, that'll be useful to avoid many writes
> anyway.
>
> > Future improvements
> >
> >  1. Auto-create journal file:
> >
> >       □ Detect or create RAW_DEVICE_PATH automatically if missing.
>
> Better use the ext3/4 native way of allocating blocks for the journal.
>
> >  2. Early-boot inode→path scanner:
> >
> >       □ Enables path resolution before RPCs are available.
> >
> >       □ Will allow smarter, full-path-based replay.
>
> Does the normal path lookup not work? At worse by rearranging some code
> to provide an internal version not meant for RPCs.
>
> >  3. Path-based ignore rules:
> >
> >       □ Skip metadata updates for files/directories matching patterns:
> >
> >           ☆ Paths like /.git/,
>
> You *don't* want to make .git unsafe.
>
> > /build/, etc.
>
> "build" risks quite a few false positives.
>
> >           ☆ Extensions like .o, .a, ~, .swp
>
> These would be more commonly used indeed.
>
> Samuel
>

Re: [PATCH] Prototype metadata journaling system for libdiskfs

Reply via email to