Hey Samuel,

While doing this I realized there is an assert that blows up when there is
no journal present in that patch (rookie mistake I'm sorry about that).

I can fix that real quick...or we can leave that for after you've seen it.

Let me know.
Milos

On Thu, Mar 19, 2026, 10:04 AM Guy Fleury <[email protected]> wrote:

>
>
> Le 19 mars 2026 18:06:17 GMT+02:00, Milos Nikic <[email protected]> a
> écrit :
> >Hello,
> >
> >Thanks for trying to build this.
> >
> >Yeah I think the
> >hurd-0.9.git20251029 is a bit dated and
> >doesn't contain all the necessary code in libstore and other libs that is
> >needed ( the store_sync function etc) so it won't compile/work.
> >
> >For now it needs to be built from Hurd git repository Master branch. If
> >that is an option.
> >
> >I could try and create a separate giant patch that encompasses all the
> >changes since
> >hurd-0.9.git20251029 , but Im not sure that is the right way.
> No need for that extra works. I will try master
> >
> >
> >Regards,
> >Milos
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >On Thu, Mar 19, 2026, 7:07 AM gfleury <[email protected]> wrote:
> >
> >> Hi,
> >>
> >> I'm attempting to build Hurd with the JBD2 journaling patch on AMD64,
> but
> >> I'm encountering a compilation error that I need help resolving.
> >>
> >> Here's what I'm doing:
> >>
> >> 1. apt update
> >> 2. apt source hurd
> >> 3. cd hurd-0.9.git20251029
> >> 4. patch -p1 <
> >> ../v5-0001-ext2fs-Add-JBD2-journaling-to-ext2-libdiskfs.patch
> >> 5. sudo dpkg-buildpackage -us -uc -b
> >>
> >> The build fails with this error:
> >>
> >> ```
> >> ../../ext2fs/journal.c: In function 'flush_to_disk':
> >> ../../ext2fs/journal.c:592:17: error: implicit declaration of function
> >> 'store_sync' [-Wimplicit-function-declaration]
> >> 592 |   error_t err = store_sync (store);
> >>     |                 ^~~~~~~~~~
> >> ```
> >>
> >> Interestingly, the store_sync function doesn't seem to exist in
> >> hurd-0.9.git20251029, but it is present in the current master branch.
> Could
> >> this be a version compatibility issue with the patch?
> >>
> >> Any guidance would be appreciated. Thanks!
> >>
> >> Le 2026-03-18 01:04, Milos Nikic a écrit :
> >>
> >> Hello again Samuel,
> >>
> >> First of all, I want to apologize again for the patch churn over the
> >> past week. I wanted to put this to rest properly, and I am now sending
> >> my final, stable version.
> >>
> >> This is it. I have applied numerous fixes, performance tweaks, and
> >> cleanups. I am happy to report that this now performs on par with
> >> unjournaled ext2 on normal workloads, such as configuring/compiling
> >> the Hurd, installing and reinstalling packages via APT, and untarring
> >> large archives (like the Linux kernel). I have also heavily tested it
> >> against artificial stress conditions (which I am happy to share if
> >> there is interest), and it handles highly concurrent loads beautifully
> >> without deadlocks or memory leaks.
> >>
> >> Progressive checkpointing ensures the filesystem runs smoothly, and
> >> the feature remains strictly opt-in (until a partition is tuned with
> >> tune2fs -j, the journal is completely inactive).
> >>
> >> The new API in libdiskfs is minimal but expressive enough to wrap all
> >> filesystem operations in transactions and handle strict POSIX sync
> >> barriers.
> >>
> >> Since v4, I have made several major architectural improvements:
> >>
> >> Smart Auto-Commit: diskfs_journal_stop_transaction now automatically
> >> commits to disk if needs_sync has been flagged anywhere in the nested
> >> RPC chain and the reference count drops to zero.
> >>
> >> Cleaned ext2 internal Journal API: I have exposed journal_store_write
> >> and journal_store_read as block-device filter layers. Internal state
> >> checks (journal_has_active_transaction, etc.) are now strictly hidden.
> >> How the journal preserves the WAL property is now very obvious, as it
> >> directly intercepts physical store operations.
> >>
> >> The "Lifeboat" Cache: Those store wrappers now utilize a small,
> >> temporary internal cache to handle situations where the Mach VM pager
> >> rushes blocks due to memory pressure. The Lifeboat seamlessly
> >> intercepts and absorbs these hazard blocks without blocking the pager
> >> or emitting warnings, even at peak write throughput.
> >>
> >> As before, I have added detailed comments across the patch to explain
> >> the state machine and locking hierarchy. I know this is a complex
> >> subsystem, so I am more than happy to write additional documentation
> >> in whatever form is needed.
> >>
> >> Once again, apologies for the rapid iterations. I won't be touching
> >> this code further until I hear your feedback.
> >>
> >> Kind regards,
> >> Milos
> >>
> >> On Sun, Mar 15, 2026 at 9:01 PM Milos Nikic <[email protected]>
> >> wrote:
> >>
> >>
> >> Hi Samuel,
> >>
> >> I am writing to sincerely apologize for the insane amount of patch churn
> >> over the last week. I know the rapid version bumps from v2 up to v4 have
> >> been incredibly noisy, and I want to hit the brakes before you spend any
> >> more time reviewing the current code.
> >>
> >> While running some extreme stress tests on a very small ext2 partition
> >> with the tiniest (allowed by the tooling) journal, I caught a few
> critical
> >> edge cases. While fixing those, I also realized that my libdiskfs VFS
> API
> >> boundary is clunkier than it needs to be. I am currently rewriting it to
> >> more closely match Linux's JBD2 semantics, where the VFS simply flags a
> >> transaction for sync and calls stop, allowing the journal to auto-commit
> >> when the reference count drops to zero.
> >>
> >> I'm also adding handling for cases where the Mach VM pager rushes blocks
> >> to the disk while they are in the process of committing. This safely
> >> intercepts them and will remove those warnings and WAL violations in
> almost
> >> all cases.
> >>
> >> Please completely disregard v4.
> >>
> >> I promise the churn is coming to an end. I am going to take a little
> time
> >> to finish this API contraction, stress-test it, polish it, and make
> sure it
> >> is 100% rock-solid. I will be back soon with a finalized v5.
> >>
> >> Thanks for your patience with my crazy iteration process!
> >>
> >> Best, Milos
> >>
> >>
> >>
> >> On Thu, Mar 12, 2026 at 8:53 AM Milos Nikic <[email protected]>
> >> wrote:
> >>
> >>
> >> Hi Samuel,
> >>
> >> As promised, here is the thoroughly tested and benchmarked V4 revision
> of
> >> the JBD2 Journal for Hurd.
> >>
> >> This revision addresses a major performance bottleneck present in V3
> under
> >> heavy concurrent workloads. The new design restores performance to match
> >> vanilla Hurd's unjournaled ext2fs while preserving full crash
> consistency.
> >>
> >> Changes since V3:
> >> - Removed eager memcpy() from the journal_dirty_block() hot-path.
> >> - Introduced deferred block copying that triggers only when the
> >> transaction becomes quiescent.
> >> - Added a `needs_copy` flag to prevent redundant memory copies.
> >> - Eliminated the severe lock contention and memory bandwidth pressure
> >> observed in V3.
> >>
> >> Why the changes in v4 vs v3?
> >> I have previously identified that the last remaining performance
> >> bottleneck is memcpy of 4k byte every time journal_dirty_block is
> called.
> >> And i was thinking about how to improve it.
> >> Deferred copy comes to mind, But...
> >> The Hurd VFS locks at the node level rather than the physical block
> level
> >> (as Linux does). Because multiple nodes may share the same 4KB disk
> block,
> >> naively deferring the journal copy until commit time can capture torn
> >> writes if another thread is actively modifying a neighboring node in the
> >> same block.
> >>
> >> Precisely because of this V3 performed a 4KB memcpy immediately inside
> >> journal_dirty_block() (copy on write) while the node lock was held.
> While
> >> safe, this placed expensive memory operations and global journal lock
> >> contention directly in the VFS hot-path, causing severe slowdowns under
> >> heavy parallel workloads.
> >>
> >> V4 removes this eager copy entirely by leveraging an existing
> transaction
> >> invariant:
> >> All VFS threads increment and decrement the active transaction’s
> >> `t_updates` counter via the start/stop transaction functions. A
> transaction
> >> cannot commit until this counter reaches zero.
> >> When `t_updates == 0`, we are mathematically guaranteed that no VFS
> >> threads are mutating blocks belonging to the transaction. At that exact
> >> moment, the memory backing those blocks has fully settled and can be
> safely
> >> copied without risk of torn writes. A perfect place for a deferred
> write!
> >>
> >> journal_dirty_block() now simply records the dirty block id in a hash
> >> table, making the hot-path strictly O(1). (and this is why we have an
> >> amazing performance boost between v3 and v4)
> >>
> >> But we also need to avoiding redundant copies:
> >> Because transactions remain open for several seconds, `t_updates` may
> >> bounce to zero and back up many times during a heavy workload (as
> multiple
> >> VFS threads start/stop the transaction). To avoid repeatedly copying the
> >> same unchanged blocks every time the counter hits zero, each shadow
> buffer
> >> now contains a `needs_copy` flag.
> >>
> >> When a block is dirtied, the flag is set. When `t_updates` reaches zero,
> >> only buffers with `needs_copy == 1` are copied to the shadow buffers,
> after
> >> which the flag is cleared.
> >> So two things need to be true in order for a block to be copied: 1)
> >> t_updates must just hit 0 and 2) needs_copy needs to be 1
> >>
> >> This architecture completely removes the hot-path bottleneck. Journaled
> >> ext2fs now achieves performance virtually identical to vanilla ext2fs,
> even
> >> under brutal concurrency (e.g. scripts doing heavy writes from multiple
> >> shells at the same time).
> >>
> >> I know this is a dense patch with a lot to unpack. I've documented the
> >> locking and Mach VM interactions as thoroughly as possible in the code
> >> itself (roughly 1/3 of the lines are comments in ext2fs/journal.c), but
> I
> >> understand there is only so much nuance that can fit into C comments.
> >> If it would be helpful, I would be happy to draft a dedicated document
> >> detailing the journal's lifecycle, its hooks into libdiskfs/ext2, and
> the
> >> rationale behind the macro-transaction design, so future developers
> have a
> >> clear reference.
> >>
> >> Looking forward to your thoughts.
> >>
> >> Best,
> >> Milos
> >>
> >>
> >>
> >>
> >>
> >> On Tue, Mar 10, 2026 at 9:25 PM Milos Nikic <[email protected]>
> >> wrote:
> >>
> >>
> >> Hi Samuel,
> >>
> >> Just a quick heads-up: please hold off on reviewing this V3 series.
> >>
> >> While V3 version works fast for simple scenarios in single threaded
> >> situations (like configure or make ext2fs etc) I fund that while running
> >> some heavy multi-threaded stress tests on V3, a significant performance
> >> degradation happens due to lock contention bottleneck caused by the
> eager
> >> VFS memcpy hot-path. (memcpy inside journal_dirty_block which is called
> >> 1000s of time a second really becomes a performance problem.)
> >>
> >>  I have been working on a much cleaner approach that safely defers the
> >> block copying to the quiescent transaction stop state. It completely
> >> eliminates the VFS lock contention and brings the journaled performance
> >> back to vanilla ext2fs levels even with many threads competing at
> >> writing/reading/renaming in the same place.
> >>
> >> I am going to test this new architecture thoroughly over the next few
> days
> >> and will send it as V4 once I am certain it is rock solid.
> >>
> >> Thanks!
> >>
> >>
> >>
> >> On Mon, Mar 9, 2026 at 12:15 PM Milos Nikic <[email protected]>
> >> wrote:
> >>
> >>
> >> Hello Samuel and the Hurd team,
> >>
> >> I am sending over v3 of the journaling patch. I know v2 is still pending
> >> review, but while testing and profiling based on previous feedback, I
> >> realized the standard mapping wasn't scaling well for metadata-heavy
> >> workloads. I wanted to send this updated architecture your way to save
> you
> >> from spending time reviewing the obsolete v2 code.
> >>
> >> This version keeps the core JBD2 logic from v2 but introduces several
> >> structural optimizations, bug fixes, and code cleanups:
> >>     - Robin Hood Hash Map: Replaced ihash with a custom map for
> >> significantly tighter cache locality and faster lookups.
> >>     - O(1) Slab Allocator: Added a pre-allocated pool to make
> transaction
> >> buffers zero-allocation in the hot path.
> >>     -  Unified Buffer Tracking: Eliminated the dual linked-list/map
> >> structure in favor of just the map, fixing a synchronization bug from v2
> >> and simplifying the code.
> >>
> >>     - Few other small bug fixes
> >>     - Refactored Dirty Block Hooks: Moved the journal_dirty_block calls
> >> from inode.c directly into the ext2fs.h low-level block computation
> >> functions (record_global_poke, sync_global_ptr, record_indir_poke, and
> >> alloc_sync). This feels like a more natural fit and makes it much
> easier to
> >> ensure we aren't missing any call sites.
> >>
> >> Performance Benchmarks:
> >> I ran repeated tests on my machine to measure the overhead, comparing
> this
> >> v3 journal implementation against Vanilla Hurd.
> >> make ext2fs (CPU/Data bound - 5 runs):
> >>     Vanilla Hurd Average: ~2m 40.6s
> >>     Journal v3 Average: ~2m 41.3s
> >>     Result: Statistical tie. Journal overhead is practically zero.
> >>
> >> make clean && ../configure (Metadata bound - 5 runs):
> >>     Vanilla Hurd Average: ~3.90s (with latency spikes up to 4.29s)
> >>     Journal v3 Average: ~3.72s (rock-solid consistency, never breaking
> >> 3.9s)
> >>     Result: Journaled ext2 is actually faster and more predictable here
> >> due to the WAL absorbing random I/O.
> >>
> >> Crash Consistency Proof:
> >> Beyond performance, I wanted to demonstrate the actual crash recovery in
> >> action.
> >>     Boot Hurd, log in, create a directory (/home/loshmi/test-dir3).
> >>     Wait for the 5-second kjournald commit tick.
> >>     Hard crash the machine (kill -9 the QEMU process on the host).
> >>
> >> Inspecting from the Linux host before recovery shows the inode is
> >> completely busted (as expected):
> >>
> >> sudo debugfs -R "stat /home/loshmi/test-dir3" /dev/nbd0
> >>
> >> debugfs 1.47.3 (8-Jul-2025)
> >> Inode: 373911   Type: bad type    Mode:  0000   Flags: 0x0
> >> Generation: 0    Version: 0x00000000
> >> User:     0   Group:     0   Size: 0
> >> File ACL: 0 Translator: 0
> >> Links: 0   Blockcount: 0
> >> Fragment:  Address: 0    Number: 0    Size: 0
> >> ctime: 0x00000000 -- Wed Dec 31 16:00:00 1969
> >> atime: 0x00000000 -- Wed Dec 31 16:00:00 1969
> >> mtime: 0x00000000 -- Wed Dec 31 16:00:00 1969
> >> BLOCKS:
> >>
> >> Note: On Vanilla Hurd, running fsck here would permanently lose the
> >> directory or potentially cause further damage depending on luck.
> >>
> >> Triggering the journal replay:
> >> sudo e2fsck -fy /dev/nbd0
> >>
> >> Inspecting immediately  after recovery:
> >>
> >> sudo debugfs -R "stat /home/loshmi/test-dir3" /dev/nbd0
> >>
> >> debugfs 1.47.3 (8-Jul-2025)
> >> Inode: 373911   Type: directory    Mode:  0775   Flags: 0x0
> >> Generation: 1773077012    Version: 0x00000000
> >> User:  1001   Group:  1001   Size: 4096
> >> File ACL: 0 Translator: 0
> >> Links: 2   Blockcount: 8
> >> Fragment:  Address: 0    Number: 0    Size: 0
> >> ctime: 0x69af0213 -- Mon Mar  9 10:23:31 2026
> >> atime: 0x69af0213 -- Mon Mar  9 10:23:31 2026
> >> mtime: 0x69af0213 -- Mon Mar  9 10:23:31 2026
> >> BLOCKS:
> >> (0):1507738
> >> TOTAL: 1
> >>
> >> The journal successfully reconstructed the directory, and logdump
> confirms
> >> the transactions were consumed perfectly.
> >>
> >> I have run similar hard-crash tests for rename, chmod, and chown etc
> with
> >> the same successful recovery results.
> >>
> >> I've attached the v3 diff. Let me know what you think of the new hash
> map
> >> and slab allocator approach!
> >>
> >> Best,
> >> Milos
> >>
> >>
> >>
> >>
> >>
> >> On Fri, Mar 6, 2026 at 10:06 PM Milos Nikic <[email protected]>
> >> wrote:
> >>
> >>
> >> And here is the last one...
> >>
> >> I hacked up an improvement for journal_dirty_block to try and see if i
> >> could speed it up a bit.
> >> 1) Used specialized robin hood based hash table for speed (no tombstones
> >> etc) (I took it from one of my personal projects....just specialized it
> >> here a bit)
> >> 2) used a small slab allocator to avoid malloc-ing in the hot path
> >> 3) liberally sprinkled  __rdtsc() to get a sense of cycle time inside
> >> journal_dirty_block
> >>
> >> Got to say, just this simple local change managed to shave off 3-5% of
> >> slowness.
> >>
> >> So my test is:
> >> - Boot Hurd
> >> - Inside Hurd go to the Hurd build directory
> >> - run:
> >> $ make clean && ../configure
> >> $ time make ext2fs
> >>
> >> I do it multiple times for 3 different versions of ext2 libraries
> >> 1) Vanilla Hurd (No Journal): ~avg, 151 seconds
> >>
> >> 2) Enhanced JBD2 (Slab + Custom Hash): ~159 seconds (5% slower!)
> >>
> >> 3) Baseline JBD2 (malloc + libihash what was sent in V2): ~168 seconds
> >>
> >> Of course there is a lot of variability, and my laptop is not a perfect
> >> environment for these kinds of benchmarks, but this is what i have.
> >>
> >> My printouts on the screen show this:
> >> ext2fs: part:5:device:wd0: warning: === JBD2 STATS ===
> >> ext2fs: part:5:device:wd0: warning: Total Dirty Calls:      339105
> >> ext2fs: part:5:device:wd0: warning: Total Function:         217101909
> >> cycles
> >> ext2fs: part:5:device:wd0: warning: Total Lock Wait:        16741691
> cycles
> >> ext2fs: part:5:device:wd0: warning: Total Alloc:            673363
> cycles
> >> ext2fs: part:5:device:wd0: warning: Total Memcpy:           137938008
> >> cycles
> >> ext2fs: part:5:device:wd0: warning: Total Hash Add:         258533
> cycles
> >> ext2fs: part:5:device:wd0: warning: Total Hash Find:        29501960
> cycles
> >> ext2fs: part:5:device:wd0: warning: --- AVERAGES (Amortized per call)
> ---
> >> ext2fs: part:5:device:wd0: warning: Avg Function Time: 640 cycles
> >> ext2fs: part:5:device:wd0: warning: Avg Lock Wait:     49 cycles
> >> ext2fs: part:5:device:wd0: warning: Avg Memcpy:        406 cycles
> >> ext2fs: part:5:device:wd0: warning: Avg Malloc 1:      1 cycles
> >> ext2fs: part:5:device:wd0: warning: Avg Hash Add:      0 cycles
> >> ext2fs: part:5:device:wd0: warning: Avg Hash Find:      86 cycles
> >> ext2fs: part:5:device:wd0: warning: ==================
> >>
> >> Averages here say a lot...with these improvements we are now down to
> >> basically Memcpy time...and for copying 4096 bytes  of ram Im not sure
> we
> >> can make it take less than 400 cycles...so we are hitting hardware
> >> limitations.
> >> It would be great if we could avoid memcpy here altogether or delay it
> >> until commit or similar, and i have some ideas, but they all require
> >> drastic changes across libdiskfs and ext2fs, not sure if a few remaining
> >> percentage points of slowdown warrant that.
> >>
> >> Also, wow during ext2 compilation...this function (journal_dirty_block)
> is
> >> being called a bit more than 1000 times per second (for each and every
> >> block that is ever being touched by the compiler)
> >>
> >> I am attaching here the altered journal.c with these changes if one is
> >> interested in seeing the localized changes.
> >>
> >> Regards,
> >> Milos
> >>
> >> On Fri, Mar 6, 2026 at 11:09 AM Milos Nikic <[email protected]>
> >> wrote:
> >>
> >>
> >> Hi Samuel,
> >>
> >> One quick detail I forgot to mention regarding the performance analysis:
> >>
> >> The entire ~0.4s performance impact I measured is isolated exclusively
> to
> >> journal_dirty_block.
> >>
> >> To verify this, I ran an experiment where I stubbed out
> >> journal_dirty_block so it just returned immediately (which obviously
> makes
> >> for a very fast, but not very useful, journal!). With that single
> function
> >> bypassed, the filesystem performs identically to vanilla Hurd.
> >>
> >> This confirms that the background kjournald flusher, the transaction
> >> reference counting, and the checkpointing logic add absolutely no
> >> noticeable latency to the VFS. The overhead is strictly tied to the
> physics
> >> of the memory copying and hashmap lookups in that one block which we can
> >> improve in subsequent patches.
> >>
> >> Thanks, Milos
> >>
> >>
> >> On Fri, Mar 6, 2026 at 10:55 AM Milos Nikic <[email protected]>
> >> wrote:
> >>
> >>
> >> Hi Samuel,
> >>
> >> Thanks for reviewing my mental model on V1; I appreciate the detailed
> >> feedback.
> >>
> >> Attached is the v2 patch. Here is a breakdown of the architectural
> changes
> >> and refactors based on your review:
> >>
> >> 1. diskfs_node_update and the Pager
> >> Regarding the question, "Do we really want to update the node?": Yes, we
> >> must update it with every change. JBD2 works strictly at the physical
> block
> >> level, not the abstract node cache level. To capture a node change in
> the
> >> journal, the block content must be physically serialized to the
> transaction
> >> buffer. Currently, this path is diskfs_node_update ->
> diskfs_write_disknode
> >> -> journal_dirty_block.
> >> When wait is 0, this just copies the node details from the node-cache to
> >> the pager. It is strictly an in-memory serialization and is extremely
> fast.
> >> I have updated the documentation for diskfs_node_update to explicitly
> >> describe this behavior so future maintainers understand it isn't
> triggering
> >> synchronous disk I/O and doesn't measurably increase the latency of the
> >> file system.
> >> journal_dirty_block  is not one of the most hammered functions in
> >> libdiskfs/ext2 and more on that below.
> >>
> >> 2. Synchronous Wait & Factorization
> >> I completely agree with your factorization advice:
> >> write_disknode_journaled has been folded directly into
> >> diskfs_write_disknode, making it much cleaner.
> >> Regarding the wait flag: we are no longer ignoring it! Instead of
> blocking
> >> the VFS deeply in the stack, we now set an "IOU" flag on the
> transaction.
> >> This bubbles the sync requirement up to the outer RPC layer, which is
> the
> >> only place safe enough to actually sleep on the commit and thus maintain
> >> the POSIX sync requirement without deadlocking etc.
> >>
> >> 3. Multiple Writes to the Same Metadata Block
> >> "Can it happen that we write several times to the same metadata block?"
> >> Yes, multiple nodes can live in the same block. However, because the
> Mach
> >> pager always flushes the "latest snapshot" of the block, we don't have
> an
> >> issue with mixed or stale data hitting the disk.
> >> If RPCs hit while pager is actively writing that is all captured in the
> >> "RUNNING TRANSACTION". If it happens that that RUNNING TRANSACTION has
> the
> >> same blocks pager is committing RUNNING TRANSACTION will be forcebliy
> >> committed.
> >>
> >> 4. The New libdiskfs API
> >> I added two new opaque accessors to diskfs.h:
> >>
> >>     diskfs_journal_set_sync
> >>     diskfs_journal_needs_sync
> >>
> >>     This allows inner nested functions to declare a strict need for a
> >> POSIX sync without causing lock inversions. We only commit at the top
> RPC
> >> layer once the operation is fully complete and locks are dropped.
> >>
> >> 5. Cleanups & Ordering
> >>     Removed the redundant record_global_poke calls.
> >>     Reordered the pager write notification in journal.c to sit after the
> >> committing function, as the pager write happens after the journal
> commit.
> >>     Merged the ext2_journal checks inside
> diskfs_journal_start_transaction
> >> to return early.
> >>     Reverted the bold unlock moves.
> >>     Fixed the information leaks.
> >>     Elevated the deadlock/WAL bypass logs to ext2_warning.
> >>
> >> Performance:
> >> I investigated the ~0.4s (increase from 4.9s to 5.3s) regression on my
> SSD
> >> during a heavy Hurd ../configure test. By stubbing out
> journal_dirty_block,
> >> performance returned to vanilla Hurd speeds, isolating the overhead to
> that
> >> specific function.
> >>
> >> A nanosecond profile reveals the cost is evenly split across the
> mandatory
> >> physics of a block journal:
> >>
> >>     25%: Lock Contention (Global transaction serialization)
> >>
> >>     22%: Memcpy (Shadowing the 4KB blocks)
> >>
> >>     21%: Hash Find (hurd_ihash lookups for block deduplication)
> >>
> >> I was surprised to see hurd_ihash taking up nearly a quarter of the
> >> overhead. I added some collision mitigation, but left a further
> >> improvements of this patch to keep the scope tight. In the future, we
> could
> >> drop the malloc entirely using a slab allocator and optimize the
> hashmap to
> >> get this overhead closer to zero (along with introducing a "frozen data"
> >> concept like Linux does but that would be a bigger non localized
> change).
> >>
> >> Final Note on Lock Hierarchy
> >> The intended, deadlock-free use of the journal in libdiskfs is best
> >> illustrated by the CHANGE_NODE_FIELD macro in libdiskfs/priv.h
> >>   txn = diskfs_journal_start_transaction ();
> >>   pthread_mutex_lock (&np->lock);
> >>   (OPERATION);
> >>   diskfs_node_update (np, diskfs_synchronous);
> >>   pthread_mutex_unlock (&np->lock);
> >>   if (diskfs_synchronous || diskfs_journal_needs_sync (txn))
> >>     diskfs_journal_commit_transaction (txn);
> >>   else
> >>     diskfs_journal_stop_transaction (txn);
> >>
> >> By keeping journal operations strictly outside of the node
> >> locking/unlocking phases, we treat it as the outermost "lock" on the
> file
> >> system, mathematically preventing deadlocks.
> >>
> >> Kind regards,
> >> Milos
> >>
> >>
> >>
> >> On Thu, Mar 5, 2026 at 12:41 PM Samuel Thibault <
> [email protected]>
> >> wrote:
> >>
> >>
> >> Hello,
> >>
> >> Milos Nikic, le jeu. 05 mars 2026 09:31:26 -0800, a ecrit:
> >>
> >> Hurd VFS works in 3 layers:
> >>
> >>  1. Node cache layer: The abstract node lives here and it is the ground
> >> truth
> >>     of a running file system. When one does a stat myfile.txt, we get
> the
> >>     information straight from the cache. When we create a new file, it
> gets
> >>     placed in the cache, etc.
> >>
> >>  2. Pager layer: This is where nodes are serialized into the actual
> >> physical
> >>     representation (4KB blocks) that will later be written to disk.
> >>
> >>  3. Hard drive: The physical storage that receives the bytes from the
> >> pager.
> >>
> >> During normal operations (not a sync mount, fsync, etc.), the VFS
> operates
> >> almost entirely on Layer 1: The Node cache layer. This is why it's super
> >> fast.
> >> User changed atime? No problem. It just fetches a node from the node
> cache
> >> (hash table lookup, amortized to O(1)) and updates the struct in memory.
> >> And
> >> that is it.
> >>
> >>
> >> Yes, so that we get as efficient as possible.
> >>
> >> Only when the sync interval hits (every 30 seconds by default) does the
> >> Node
> >> cache get iterated and serialized to the pager layer
> >> (diskfs_sync_everything ->
> >>  write_all_disknodes -> write_node -> pager_sync). So basically, at that
> >> moment, we create a snapshot of the state of the node cache and place it
> >> onto
> >> the pager(s).
> >>
> >>
> >> It's not exactly a snapshot because the coherency between inodes and
> >> data is not completely enforced (we write all disknodes before asking
> >> the kernel to write back dirty pages, and then poke the writes).
> >>
> >> Even then, pager_sync is called with wait = 0. It is handed to the
> pager,
> >> which
> >> sends it to Mach. At some later time (seconds or so later), Mach sends
> it
> >> back
> >> to the ext2 pager, which finally issues store_write to write it to
> Layer 3
> >> (The
> >> Hard drive). And even that depends on how the driver reorders or delays
> it.
> >>
> >> The effect of this architecture is that when store_write is finally
> >> called, the
> >> absolute latest version of the node cache snapshot is what gets written
> to
> >> the
> >> storage. Is this basically correct?
> >>
> >>
> >> It seems to be so indeed.
> >>
> >> Are there any edge cases or mechanics that are wrong in this model
> >> that would make us receive a "stale" node cache snapshot?
> >>
> >>
> >> Well, it can be "stale" if another RPC hasn't called
> >> diskfs_node_update() yet, but that's what "safe" FS are all about: not
> >> actually provide more than coherency of the content on the disk so fsck
> >> is not suppposed to be needed. Then, if a program really wants coherency
> >> between some files etc. it has to issue sync calls, dpkg does it for
> >> instance.
> >>
> >> Samuel
> >>
> >>
> >>
>

Reply via email to