Thanks for the write up. It is great.

Two minor details:
1) "Currently, the journal restores: `mtime`, `ctime`, `st_mode`, and
+flags"

There is also uid, gid, and author that should be on that list. That is
also journaled and restored.

And
2) +* Two write modes:
+
+ * Sync (default): blocking write; caller waits for journal flush.
+
+ * Async fallback: used only if writing fails (e.g., file temporarily
+   unavailable); entries are queued and flushed later.

Only sync is on now, async has been removed, so maybe we don't need to
mention it.

Thanks once more!
MIlos

On Tue, Aug 12, 2025 at 6:56 AM jbra...@dismail.de <jbra...@dismail.de>
wrote:

> * hurd/libdiskfs.mdwn: add a short summary paragraph.
> * hurd/libdiskfs/journal.mdwn: new file.
> ---
>  hurd/libdiskfs.mdwn         |  10 +-
>  hurd/libdiskfs/journal.mdwn | 238 ++++++++++++++++++++++++++++++++++++
>  2 files changed, 247 insertions(+), 1 deletion(-)
>  create mode 100644 hurd/libdiskfs/journal.mdwn
>
> diff --git a/hurd/libdiskfs.mdwn b/hurd/libdiskfs.mdwn
> index dd499785..c939905b 100644
> --- a/hurd/libdiskfs.mdwn
> +++ b/hurd/libdiskfs.mdwn
> @@ -1,4 +1,4 @@
> -[[!meta copyright="Copyright © 2011 Free Software Foundation, Inc."]]
> +[[!meta copyright="Copyright © 2011, 2025 Free Software Foundation,
> Inc."]]
>
>  [[!meta license="""[[!toggle id="license" text="GFDL 1.2+"]][[!toggleable
>  id="license" text="Permission is granted to copy, distribute and/or
> modify this
> @@ -8,6 +8,14 @@ Sections, no Front-Cover Texts, and no Back-Cover Texts.
> A copy of the license
>  is included in the section entitled [[GNU Free Documentation
>  License|/fdl]]."]]"""]]
>
> +Hurd developers use `libdiskfs` to write filesystems like
> +[[translator/ext2fs]] and [[translator/fatfs]].  `libdiskfs` does
> +suffer from [[locking
> +issues|community/gsoc/project_ideas/libdiskfs_locking]].  In the
> +summer of 2025, Milos Nikic began adding a metadata
> +[[libdiskfs/journal]]. So far one can only use the journal for ext2fs.
> +It is not compatible with ext3 or ext4's journal.
> +
>
>  # Paging
>
> diff --git a/hurd/libdiskfs/journal.mdwn b/hurd/libdiskfs/journal.mdwn
> new file mode 100644
> index 00000000..f2bf70f5
> --- /dev/null
> +++ b/hurd/libdiskfs/journal.mdwn
> @@ -0,0 +1,238 @@
> +[[!meta copyright="Copyright © 2025 Free Software Foundation, Inc."]]
> +
> +[[!meta license="""[[!toggle id="license" text="GFDL 1.2+"]][[!toggleable
> +id="license" text="Permission is granted to copy, distribute and/or
> modify this
> +document under the terms of the GNU Free Documentation License, Version
> 1.2 or
> +any later version published by the Free Software Foundation; with no
> Invariant
> +Sections, no Front-Cover Texts, and no Back-Cover Texts.  A copy of the
> license
> +is included in the section entitled [[GNU Free Documentation
> +License|/fdl]]."]]"""]]
> +
> +In the summer of 2025, Milos Nikic began working on a metadata
> +journaling subsystem for libdiskfs, which he started using with
> +ext2fs. His prototype journal stores metadata changes to raw disk
> +space outside of the ext2 filesystem but within the same partition.
> +On boot, before fsck runs, the journal is replayed to fix
> +inconsistencies. This journal should fix most issues that hard
> +shutdowns cause. Hopefully the ASCII art below is helpful.
> +
> +      |-------------+-------------+-------------|
> +      | partition 1 | partition 2 | partition 3 |
> +      |-------------+-------------+-------------|
> +      | begin ext2  | begin ext2  |             |
> +         | journal     | journal     |             |
> +         | config data | config data |             |
> +         |             |             |             |
> +      | /           | /home       |    swap     |
> +         |             |             |             |
> +      | end ext2    | end ext2    |             |
> +      |-------------+-------------|             |
> +      | journal in  | journal in  |             |
> +      | raw disk    | raw disk    |             |
> +      | space. 8MiB | space. 8MiB |             |
> +      |-------------+-------------+-------------|
> +
> +The journal is *not* a replacement for fsck, checksumming, ext4-style
> +transactions, or a strong consistency guarantee. It’s a *best-effort*,
> +*do-no-harm* crash-recovery helper that complements fsck by restoring
> +metadata and paths opportunistically.  This journal is not compatible
> +with ext3 or ext4's journal.
> +
> +The journaling subsystem writes metadata changes to a reserved raw
> +disk area outside the ext2-managed region.  The location and size are
> +discovered from `journal_hint` inside ext2 superblock at boot.
> +Entries are written in a compact binary format with CRC32 protection,
> +stored in a circular buffer.  Early-boot replay reads the journal,
> +validates entries, and applies the most recent consistent metadata
> +state to the filesystem, including restoration of deleted or modified
> +files and directories.  The subsystem has been stress-tested (git
> +checkout, bulk deletions, crash/reboot loops) and successfully
> +preserves and replays metadata.
> +
> +Currently, the journal restores: `mtime`, `ctime`, `st_mode`, and
> +flags—i.e. metadata fields that can be restored without needing full
> +path knowledge.
> +
> +The journaling system is structured around a single public entry point
> +`libdiskfs/journal.c` and `libdiskfs/journal.h`. All other components
> +are internal to libdiskfs. Configuration data (offset, size, etc.) is
> +written in four reserved fields in the ext2 superblock.
> +
> +The journal captures all the major file system operations, yet not all
> +of them are used for replay for now.
> +
> +## Design details
> +
> +* Two write modes:
> +
> + * Sync (default): blocking write; caller waits for journal flush.
> +
> + * Async fallback: used only if writing fails (e.g., file temporarily
> +   unavailable); entries are queued and flushed later.
> +
> +* Journal file format:
> +
> + * Ring buffer
> +
> + * Magic/version checked
> +
> + * CRC32-protected header and entries
> +
> +* Boot-time replay:
> +
> + * During early boot, pread/write are unavailable. Instead, the replay
> +   code uses `_diskfs_rdwr_internal` to safely read the journal.
> +
> + * Memory use during replay is controlled via fixed-size arenas.
> +
> +* Replay logic:
> +
> + * Parsed entries are sorted and deduplicated via a graph.
> +
> + * Metadata is only restored if the journaled update is newer than the
> +   current inode `mtime`, and the values differ. It uses strong
> +   fingerprinting to prevent misapplying updates after inode reuse.
> +
> + * Replay is dual-path: inode-based first, falling back to path-based
> +   when needed.
> +
> + * “Best effort” file recreation under `/restore/[timestamp]` with
> +   correct metadata when files vanish after a crash.
> +
> +* Noise filtering:
> +
> + * A hardcoded inode range excludes `/dev/random`, `/dev/null`, and other
> + noisy devices that would otherwise spam the journal.
> +
> + * The filter contains a dedicated policy module to filter out noisy
> +events (`/tmp`, build outputs, etc.).
> +
> +*Two tricky problems took significant work:*
> +
> +   1. *Path recovery:* `cred->po->path` often gives useful file paths, but
> +   sometimes needs sanitizing or is imprecise. Combined with the current
> +   name, it’s often enough to reconstruct missing files. Replay now uses
> +   path-based recovery when inode-based recovery fails.
> +
> +   2. *Aggressive inode reuse in ext2:* After deletion (say at fsck time,
> or
> +   any time really) the same inode number may be reassigned to a
> completely
> +   different file after reboot. Fingerprinting ensures we never apply
> stale
> +   updates to the wrong file.
> +
> +## Testing & results
> +
> +- Survived repeated hard reboots under concurrent create/delete stress.
> +
> +- In chaos tests where fsck over-deleted files, journaling replay brought
> +them back as expected.
> +
> +## *Future work ideas*
> +
> +- Better path preservation to improve replay accuracy.
> +
> +- Per-node timelines for smarter change grouping.
> +
> +- Integration with ext tooling to support formatting with journaling
> fields
> +and an 8 MiB carve-out.
> +
> +- Exporting replay stats via /proc-like interface.
> +
> + * Skip metadata updates for files/directories matching patterns:
> +
> +  * Paths like `/.git/`, `/build/`, etc.
> +
> +  * Extensions like .o, .a, ~, .swp
> +
> +  * Eventually user-configurable via static list or user-supplied config.
> +
> +## How to use this metadata journal
> +
> + To use the journal one must reserve an 8 MiB space outside the ext2
> + filesystem, but within its partition and write the journaling hints
> + into the ext2 superblock.
> +
> +This means the journal will live immediately after ext2 stops on disk.
> +
> +1. Shrink the ext2 filesystem by 8 MiB
> +
> +We’ll work directly on the image, so make a backup first.
> +First, find the ext2 partition start offset.
> +
> +               $ parted -sm debian-hurd.img unit B print
> +
> +Example output:
> +
> +       2:1000341504B:4194303999B:3193962496B:ext2::;
> +
> +The first number after 2: is the byte offset where the ext2 partition
> starts (1000341504 here).
> +
> +- Attach the ext2 partition as a loop device
> +
> +               # losetup -o 1000341504 --show -f debian-hurd.img
> +
> +This prints something like `/dev/loop0` (use whatever it returns).
> +Check current block count (these are 4 KiB ext2 blocks)
> +
> +       # tune2fs -l /dev/loop0 | grep 'Block count'
> +
> +Example output :
> +
> +       Block count:              1035776
> +
> +Shrink by 8 MiB
> +
> +    8 MiB = 8192 KiB → 8192 / 4 = 2048 ext2 blocks
> +
> +    New block count = 1035776 − 2048 = 1033728
> +
> +       # e2fsck -f /dev/loop0 (accept everything it asks)
> +       # resize2fs /dev/loop0 1033728
> +
> +Replace `1033728` with your calculated value.
> +Verify
> +
> +    # tune2fs -l /dev/loop0 | grep 'Block count'
> +
> +The number should be exactly 2048 less than the original.
> +Detach loop device
> +
> +       # losetup -d /dev/loop0
> +
> +2  Write the journaling hint to the superblock
> +
> +The ext2 superblock is 1024 bytes from the start of the partition.
> +The journaling hint is at offset 264 bytes from the start of the
> superblock.
> +
> +You can verify ext2 magic first (0x53ef) like so:
> +
> +       $ xxd -g1 -s $((1000342528 + 0x38)) -l 2 debian-hurd.img
> +
> +(needs to print "53 ef")
> +
> +Instead of doing all the byte math manually, use the attached script:
> +Show current hint
> +
> +       $ ./journal-hint.sh debian-hurd.img show
> +
> +enable journaling hint:
> +
> +       $ ./journal-hint.sh debian-hurd.img on
> +
> +(This assumes the journal lives in the last 8 MiB of partition 2 (safe
> after the shrink))
> +Disable journaling hint
> +
> +       $ ./journal-hint.sh debian-hurd.img off
> +
> +The script verifies ext2 magic before touching anything.
> +If the magic doesn’t match, it bails to prevent corruption.
> +
> +Safety first: Always work on a copy of your disk image. If the script
> +writes incorrect offsets, the low-level writer will overwrite whatever
> +is there, potentially corrupting your system! Make sure the journal
> +location is outside the filesystem by following the shrink procedure
> +above.
> +
> +Status:
> +
> +* `debian-hurd-20230608.img` — tested and works great.
> +* `debian-hurd-20250622.img` — tested and works great.
> --
> 2.50.1
>
>

Reply via email to