Hello,

jbra...@dismail.de, le mar. 12 août 2025 14:13:06 -0400, a ecrit:
> * hurd/libdiskfs.mdwn: add a short summary paragraph.
> * hurd/libdiskfs/journal.mdwn: new file.

I'd say it's premature to document this journaling implementation, since
things can (and most probably will) change soon.

Samuel

> ---
>  hurd/libdiskfs.mdwn         |  10 +-
>  hurd/libdiskfs/journal.mdwn | 235 ++++++++++++++++++++++++++++++++++++
>  2 files changed, 244 insertions(+), 1 deletion(-)
>  create mode 100644 hurd/libdiskfs/journal.mdwn
> 
> diff --git a/hurd/libdiskfs.mdwn b/hurd/libdiskfs.mdwn
> index dd499785..c939905b 100644
> --- a/hurd/libdiskfs.mdwn
> +++ b/hurd/libdiskfs.mdwn
> @@ -1,4 +1,4 @@
> -[[!meta copyright="Copyright © 2011 Free Software Foundation, Inc."]]
> +[[!meta copyright="Copyright © 2011, 2025 Free Software Foundation, Inc."]]
>  
>  [[!meta license="""[[!toggle id="license" text="GFDL 1.2+"]][[!toggleable
>  id="license" text="Permission is granted to copy, distribute and/or modify 
> this
> @@ -8,6 +8,14 @@ Sections, no Front-Cover Texts, and no Back-Cover Texts.  A 
> copy of the license
>  is included in the section entitled [[GNU Free Documentation
>  License|/fdl]]."]]"""]]
>  
> +Hurd developers use `libdiskfs` to write filesystems like
> +[[translator/ext2fs]] and [[translator/fatfs]].  `libdiskfs` does
> +suffer from [[locking
> +issues|community/gsoc/project_ideas/libdiskfs_locking]].  In the
> +summer of 2025, Milos Nikic began adding a metadata
> +[[libdiskfs/journal]]. So far one can only use the journal for ext2fs.
> +It is not compatible with ext3 or ext4's journal.
> +
>  
>  # Paging
>  
> diff --git a/hurd/libdiskfs/journal.mdwn b/hurd/libdiskfs/journal.mdwn
> new file mode 100644
> index 00000000..8ea4506c
> --- /dev/null
> +++ b/hurd/libdiskfs/journal.mdwn
> @@ -0,0 +1,235 @@
> +[[!meta copyright="Copyright © 2025 Free Software Foundation, Inc."]]
> +
> +[[!meta license="""[[!toggle id="license" text="GFDL 1.2+"]][[!toggleable
> +id="license" text="Permission is granted to copy, distribute and/or modify 
> this
> +document under the terms of the GNU Free Documentation License, Version 1.2 
> or
> +any later version published by the Free Software Foundation; with no 
> Invariant
> +Sections, no Front-Cover Texts, and no Back-Cover Texts.  A copy of the 
> license
> +is included in the section entitled [[GNU Free Documentation
> +License|/fdl]]."]]"""]]
> +
> +In the summer of 2025, Milos Nikic began working on a metadata
> +journaling subsystem for libdiskfs, which he started using with
> +ext2fs. His prototype journal stores metadata changes to raw disk
> +space outside of the ext2 filesystem but within the same partition.
> +On boot, before fsck runs, the journal is replayed to fix
> +inconsistencies. This journal should fix most issues that hard
> +shutdowns cause. Hopefully the ASCII art below is helpful.
> +
> +      |-------------+-------------+-------------|
> +      | partition 1 | partition 2 | partition 3 |
> +      |-------------+-------------+-------------|
> +      | begin ext2  | begin ext2  |             |
> +       | journal     | journal     |             |
> +       | config data | config data |             |
> +       |             |             |             |
> +      | /           | /home       |    swap     |
> +       |             |             |             |
> +      | end ext2    | end ext2    |             |
> +      |-------------+-------------|             |
> +      | journal in  | journal in  |             |
> +      | raw disk    | raw disk    |             |
> +      | space. 8MiB | space. 8MiB |             |
> +      |-------------+-------------+-------------|
> +
> +The journal is *not* a replacement for fsck, checksumming, ext4-style
> +transactions, or a strong consistency guarantee. It’s a *best-effort*,
> +*do-no-harm* crash-recovery helper that complements fsck by restoring
> +metadata and paths opportunistically.  This journal is not compatible
> +with ext3 or ext4's journal.
> +
> +The journaling subsystem writes metadata changes to a reserved raw
> +disk area outside the ext2-managed region.  The location and size are
> +discovered from `journal_hint` inside ext2 superblock at boot.
> +Entries are written in a compact binary format with CRC32 protection,
> +stored in a circular buffer.  Early-boot replay reads the journal,
> +validates entries, and applies the most recent consistent metadata
> +state to the filesystem, including restoration of deleted or modified
> +files and directories.  The subsystem has been stress-tested (git
> +checkout, bulk deletions, crash/reboot loops) and successfully
> +preserves and replays metadata.
> +
> +Currently, the journal restores: `mtime`, `ctime`, `st_mode`, flags,
> +uid, gid, and author—i.e. metadata fields that can be restored without
> +needing full path knowledge.
> +
> +The journaling system is structured around a single public entry point
> +`libdiskfs/journal.c` and `libdiskfs/journal.h`. All other components
> +are internal to libdiskfs. Configuration data (offset, size, etc.) is
> +written in four reserved fields in the ext2 superblock.
> +
> +The journal captures all the major file system operations, yet not all
> +of them are used for replay for now.
> +
> +## Design details
> +
> +* Two write modes:
> +
> + * Sync (default): blocking write; caller waits for journal flush.
> +
> +* Journal file format:
> +
> + * Ring buffer
> +
> + * Magic/version checked
> +
> + * CRC32-protected header and entries
> +
> +* Boot-time replay:
> +
> + * During early boot, pread/write are unavailable. Instead, the replay
> +   code uses `_diskfs_rdwr_internal` to safely read the journal.
> +
> + * Memory use during replay is controlled via fixed-size arenas.
> +
> +* Replay logic:
> +
> + * Parsed entries are sorted and deduplicated via a graph.
> +
> + * Metadata is only restored if the journaled update is newer than the
> +   current inode `mtime`, and the values differ. It uses strong
> +   fingerprinting to prevent misapplying updates after inode reuse.
> +
> + * Replay is dual-path: inode-based first, falling back to path-based
> +   when needed.
> +
> + * “Best effort” file recreation under `/restore/[timestamp]` with
> +   correct metadata when files vanish after a crash.
> +
> +* Noise filtering:
> +
> + * A hardcoded inode range excludes `/dev/random`, `/dev/null`, and other
> + noisy devices that would otherwise spam the journal.
> +
> + * The filter contains a dedicated policy module to filter out noisy
> +events (`/tmp`, build outputs, etc.).
> +
> +*Two tricky problems took significant work:*
> +
> +   1. *Path recovery:* `cred->po->path` often gives useful file paths, but
> +   sometimes needs sanitizing or is imprecise. Combined with the current
> +   name, it’s often enough to reconstruct missing files. Replay now uses
> +   path-based recovery when inode-based recovery fails.
> +
> +   2. *Aggressive inode reuse in ext2:* After deletion (say at fsck time, or
> +   any time really) the same inode number may be reassigned to a completely
> +   different file after reboot. Fingerprinting ensures we never apply stale
> +   updates to the wrong file.
> +
> +## Testing & results
> +
> +- Survived repeated hard reboots under concurrent create/delete stress.
> +
> +- In chaos tests where fsck over-deleted files, journaling replay brought
> +them back as expected.
> +
> +## *Future work ideas*
> +
> +- Better path preservation to improve replay accuracy.
> +
> +- Per-node timelines for smarter change grouping.
> +
> +- Integration with ext tooling to support formatting with journaling fields
> +and an 8 MiB carve-out.
> +
> +- Exporting replay stats via /proc-like interface.
> +
> + * Skip metadata updates for files/directories matching patterns:
> +
> +  * Paths like `/.git/`, `/build/`, etc.
> +
> +  * Extensions like .o, .a, ~, .swp
> +
> +  * Eventually user-configurable via static list or user-supplied config.
> +
> +## How to use this metadata journal
> +
> + To use the journal one must reserve an 8 MiB space outside the ext2
> + filesystem, but within its partition and write the journaling hints
> + into the ext2 superblock.
> +
> +This means the journal will live immediately after ext2 stops on disk.
> +
> +1. Shrink the ext2 filesystem by 8 MiB
> +
> +We’ll work directly on the image, so make a backup first.
> +First, find the ext2 partition start offset.
> +
> +             $ parted -sm debian-hurd.img unit B print
> +
> +Example output:
> +
> +     2:1000341504B:4194303999B:3193962496B:ext2::;
> +
> +The first number after 2: is the byte offset where the ext2 partition starts 
> (1000341504 here).
> +
> +- Attach the ext2 partition as a loop device
> +
> +             # losetup -o 1000341504 --show -f debian-hurd.img
> +
> +This prints something like `/dev/loop0` (use whatever it returns).
> +Check current block count (these are 4 KiB ext2 blocks)
> +
> +     # tune2fs -l /dev/loop0 | grep 'Block count'
> +
> +Example output :
> +
> +     Block count:              1035776
> +
> +Shrink by 8 MiB
> +
> +    8 MiB = 8192 KiB → 8192 / 4 = 2048 ext2 blocks
> +
> +    New block count = 1035776 − 2048 = 1033728
> +
> +     # e2fsck -f /dev/loop0 (accept everything it asks)
> +     # resize2fs /dev/loop0 1033728
> +
> +Replace `1033728` with your calculated value.
> +Verify
> +
> +    # tune2fs -l /dev/loop0 | grep 'Block count'
> +
> +The number should be exactly 2048 less than the original.
> +Detach loop device
> +
> +     # losetup -d /dev/loop0
> +
> +2  Write the journaling hint to the superblock
> +
> +The ext2 superblock is 1024 bytes from the start of the partition.
> +The journaling hint is at offset 264 bytes from the start of the superblock.
> +
> +You can verify ext2 magic first (0x53ef) like so:
> +
> +     $ xxd -g1 -s $((1000342528 + 0x38)) -l 2 debian-hurd.img
> +
> +(needs to print "53 ef")
> +
> +Instead of doing all the byte math manually, use the attached script:
> +Show current hint
> +
> +     $ ./journal-hint.sh debian-hurd.img show
> +
> +enable journaling hint:
> +
> +     $ ./journal-hint.sh debian-hurd.img on
> +
> +(This assumes the journal lives in the last 8 MiB of partition 2 (safe after 
> the shrink))
> +Disable journaling hint
> +
> +     $ ./journal-hint.sh debian-hurd.img off
> +
> +The script verifies ext2 magic before touching anything.
> +If the magic doesn’t match, it bails to prevent corruption.
> +
> +Safety first: Always work on a copy of your disk image. If the script
> +writes incorrect offsets, the low-level writer will overwrite whatever
> +is there, potentially corrupting your system! Make sure the journal
> +location is outside the filesystem by following the shrink procedure
> +above.
> +
> +Status:
> +
> +* `debian-hurd-20230608.img` — tested and works great.
> +* `debian-hurd-20250622.img` — tested and works great.
> -- 
> 2.50.1
> 
> 

-- 
Samuel
    if (argc > 1 && strcmp(argv[1], "-advice") == 0) {
        printf("Don't Panic!\n");
        exit(42);
    }
        -- Arnold Robbins in the LJ of February '95, describing RCS

Reply via email to