Hey Samuel, While doing this I realized there is an assert that blows up when there is no journal present in that patch (rookie mistake I'm sorry about that).
I can fix that real quick...or we can leave that for after you've seen it. Let me know. Milos On Thu, Mar 19, 2026, 10:04 AM Guy Fleury <[email protected]> wrote: > > > Le 19 mars 2026 18:06:17 GMT+02:00, Milos Nikic <[email protected]> a > écrit : > >Hello, > > > >Thanks for trying to build this. > > > >Yeah I think the > >hurd-0.9.git20251029 is a bit dated and > >doesn't contain all the necessary code in libstore and other libs that is > >needed ( the store_sync function etc) so it won't compile/work. > > > >For now it needs to be built from Hurd git repository Master branch. If > >that is an option. > > > >I could try and create a separate giant patch that encompasses all the > >changes since > >hurd-0.9.git20251029 , but Im not sure that is the right way. > No need for that extra works. I will try master > > > > > >Regards, > >Milos > > > > > > > > > > > > > > > > > > > > > > > >On Thu, Mar 19, 2026, 7:07 AM gfleury <[email protected]> wrote: > > > >> Hi, > >> > >> I'm attempting to build Hurd with the JBD2 journaling patch on AMD64, > but > >> I'm encountering a compilation error that I need help resolving. > >> > >> Here's what I'm doing: > >> > >> 1. apt update > >> 2. apt source hurd > >> 3. cd hurd-0.9.git20251029 > >> 4. patch -p1 < > >> ../v5-0001-ext2fs-Add-JBD2-journaling-to-ext2-libdiskfs.patch > >> 5. sudo dpkg-buildpackage -us -uc -b > >> > >> The build fails with this error: > >> > >> ``` > >> ../../ext2fs/journal.c: In function 'flush_to_disk': > >> ../../ext2fs/journal.c:592:17: error: implicit declaration of function > >> 'store_sync' [-Wimplicit-function-declaration] > >> 592 | error_t err = store_sync (store); > >> | ^~~~~~~~~~ > >> ``` > >> > >> Interestingly, the store_sync function doesn't seem to exist in > >> hurd-0.9.git20251029, but it is present in the current master branch. > Could > >> this be a version compatibility issue with the patch? > >> > >> Any guidance would be appreciated. Thanks! > >> > >> Le 2026-03-18 01:04, Milos Nikic a écrit : > >> > >> Hello again Samuel, > >> > >> First of all, I want to apologize again for the patch churn over the > >> past week. I wanted to put this to rest properly, and I am now sending > >> my final, stable version. > >> > >> This is it. I have applied numerous fixes, performance tweaks, and > >> cleanups. I am happy to report that this now performs on par with > >> unjournaled ext2 on normal workloads, such as configuring/compiling > >> the Hurd, installing and reinstalling packages via APT, and untarring > >> large archives (like the Linux kernel). I have also heavily tested it > >> against artificial stress conditions (which I am happy to share if > >> there is interest), and it handles highly concurrent loads beautifully > >> without deadlocks or memory leaks. > >> > >> Progressive checkpointing ensures the filesystem runs smoothly, and > >> the feature remains strictly opt-in (until a partition is tuned with > >> tune2fs -j, the journal is completely inactive). > >> > >> The new API in libdiskfs is minimal but expressive enough to wrap all > >> filesystem operations in transactions and handle strict POSIX sync > >> barriers. > >> > >> Since v4, I have made several major architectural improvements: > >> > >> Smart Auto-Commit: diskfs_journal_stop_transaction now automatically > >> commits to disk if needs_sync has been flagged anywhere in the nested > >> RPC chain and the reference count drops to zero. > >> > >> Cleaned ext2 internal Journal API: I have exposed journal_store_write > >> and journal_store_read as block-device filter layers. Internal state > >> checks (journal_has_active_transaction, etc.) are now strictly hidden. > >> How the journal preserves the WAL property is now very obvious, as it > >> directly intercepts physical store operations. > >> > >> The "Lifeboat" Cache: Those store wrappers now utilize a small, > >> temporary internal cache to handle situations where the Mach VM pager > >> rushes blocks due to memory pressure. The Lifeboat seamlessly > >> intercepts and absorbs these hazard blocks without blocking the pager > >> or emitting warnings, even at peak write throughput. > >> > >> As before, I have added detailed comments across the patch to explain > >> the state machine and locking hierarchy. I know this is a complex > >> subsystem, so I am more than happy to write additional documentation > >> in whatever form is needed. > >> > >> Once again, apologies for the rapid iterations. I won't be touching > >> this code further until I hear your feedback. > >> > >> Kind regards, > >> Milos > >> > >> On Sun, Mar 15, 2026 at 9:01 PM Milos Nikic <[email protected]> > >> wrote: > >> > >> > >> Hi Samuel, > >> > >> I am writing to sincerely apologize for the insane amount of patch churn > >> over the last week. I know the rapid version bumps from v2 up to v4 have > >> been incredibly noisy, and I want to hit the brakes before you spend any > >> more time reviewing the current code. > >> > >> While running some extreme stress tests on a very small ext2 partition > >> with the tiniest (allowed by the tooling) journal, I caught a few > critical > >> edge cases. While fixing those, I also realized that my libdiskfs VFS > API > >> boundary is clunkier than it needs to be. I am currently rewriting it to > >> more closely match Linux's JBD2 semantics, where the VFS simply flags a > >> transaction for sync and calls stop, allowing the journal to auto-commit > >> when the reference count drops to zero. > >> > >> I'm also adding handling for cases where the Mach VM pager rushes blocks > >> to the disk while they are in the process of committing. This safely > >> intercepts them and will remove those warnings and WAL violations in > almost > >> all cases. > >> > >> Please completely disregard v4. > >> > >> I promise the churn is coming to an end. I am going to take a little > time > >> to finish this API contraction, stress-test it, polish it, and make > sure it > >> is 100% rock-solid. I will be back soon with a finalized v5. > >> > >> Thanks for your patience with my crazy iteration process! > >> > >> Best, Milos > >> > >> > >> > >> On Thu, Mar 12, 2026 at 8:53 AM Milos Nikic <[email protected]> > >> wrote: > >> > >> > >> Hi Samuel, > >> > >> As promised, here is the thoroughly tested and benchmarked V4 revision > of > >> the JBD2 Journal for Hurd. > >> > >> This revision addresses a major performance bottleneck present in V3 > under > >> heavy concurrent workloads. The new design restores performance to match > >> vanilla Hurd's unjournaled ext2fs while preserving full crash > consistency. > >> > >> Changes since V3: > >> - Removed eager memcpy() from the journal_dirty_block() hot-path. > >> - Introduced deferred block copying that triggers only when the > >> transaction becomes quiescent. > >> - Added a `needs_copy` flag to prevent redundant memory copies. > >> - Eliminated the severe lock contention and memory bandwidth pressure > >> observed in V3. > >> > >> Why the changes in v4 vs v3? > >> I have previously identified that the last remaining performance > >> bottleneck is memcpy of 4k byte every time journal_dirty_block is > called. > >> And i was thinking about how to improve it. > >> Deferred copy comes to mind, But... > >> The Hurd VFS locks at the node level rather than the physical block > level > >> (as Linux does). Because multiple nodes may share the same 4KB disk > block, > >> naively deferring the journal copy until commit time can capture torn > >> writes if another thread is actively modifying a neighboring node in the > >> same block. > >> > >> Precisely because of this V3 performed a 4KB memcpy immediately inside > >> journal_dirty_block() (copy on write) while the node lock was held. > While > >> safe, this placed expensive memory operations and global journal lock > >> contention directly in the VFS hot-path, causing severe slowdowns under > >> heavy parallel workloads. > >> > >> V4 removes this eager copy entirely by leveraging an existing > transaction > >> invariant: > >> All VFS threads increment and decrement the active transaction’s > >> `t_updates` counter via the start/stop transaction functions. A > transaction > >> cannot commit until this counter reaches zero. > >> When `t_updates == 0`, we are mathematically guaranteed that no VFS > >> threads are mutating blocks belonging to the transaction. At that exact > >> moment, the memory backing those blocks has fully settled and can be > safely > >> copied without risk of torn writes. A perfect place for a deferred > write! > >> > >> journal_dirty_block() now simply records the dirty block id in a hash > >> table, making the hot-path strictly O(1). (and this is why we have an > >> amazing performance boost between v3 and v4) > >> > >> But we also need to avoiding redundant copies: > >> Because transactions remain open for several seconds, `t_updates` may > >> bounce to zero and back up many times during a heavy workload (as > multiple > >> VFS threads start/stop the transaction). To avoid repeatedly copying the > >> same unchanged blocks every time the counter hits zero, each shadow > buffer > >> now contains a `needs_copy` flag. > >> > >> When a block is dirtied, the flag is set. When `t_updates` reaches zero, > >> only buffers with `needs_copy == 1` are copied to the shadow buffers, > after > >> which the flag is cleared. > >> So two things need to be true in order for a block to be copied: 1) > >> t_updates must just hit 0 and 2) needs_copy needs to be 1 > >> > >> This architecture completely removes the hot-path bottleneck. Journaled > >> ext2fs now achieves performance virtually identical to vanilla ext2fs, > even > >> under brutal concurrency (e.g. scripts doing heavy writes from multiple > >> shells at the same time). > >> > >> I know this is a dense patch with a lot to unpack. I've documented the > >> locking and Mach VM interactions as thoroughly as possible in the code > >> itself (roughly 1/3 of the lines are comments in ext2fs/journal.c), but > I > >> understand there is only so much nuance that can fit into C comments. > >> If it would be helpful, I would be happy to draft a dedicated document > >> detailing the journal's lifecycle, its hooks into libdiskfs/ext2, and > the > >> rationale behind the macro-transaction design, so future developers > have a > >> clear reference. > >> > >> Looking forward to your thoughts. > >> > >> Best, > >> Milos > >> > >> > >> > >> > >> > >> On Tue, Mar 10, 2026 at 9:25 PM Milos Nikic <[email protected]> > >> wrote: > >> > >> > >> Hi Samuel, > >> > >> Just a quick heads-up: please hold off on reviewing this V3 series. > >> > >> While V3 version works fast for simple scenarios in single threaded > >> situations (like configure or make ext2fs etc) I fund that while running > >> some heavy multi-threaded stress tests on V3, a significant performance > >> degradation happens due to lock contention bottleneck caused by the > eager > >> VFS memcpy hot-path. (memcpy inside journal_dirty_block which is called > >> 1000s of time a second really becomes a performance problem.) > >> > >> I have been working on a much cleaner approach that safely defers the > >> block copying to the quiescent transaction stop state. It completely > >> eliminates the VFS lock contention and brings the journaled performance > >> back to vanilla ext2fs levels even with many threads competing at > >> writing/reading/renaming in the same place. > >> > >> I am going to test this new architecture thoroughly over the next few > days > >> and will send it as V4 once I am certain it is rock solid. > >> > >> Thanks! > >> > >> > >> > >> On Mon, Mar 9, 2026 at 12:15 PM Milos Nikic <[email protected]> > >> wrote: > >> > >> > >> Hello Samuel and the Hurd team, > >> > >> I am sending over v3 of the journaling patch. I know v2 is still pending > >> review, but while testing and profiling based on previous feedback, I > >> realized the standard mapping wasn't scaling well for metadata-heavy > >> workloads. I wanted to send this updated architecture your way to save > you > >> from spending time reviewing the obsolete v2 code. > >> > >> This version keeps the core JBD2 logic from v2 but introduces several > >> structural optimizations, bug fixes, and code cleanups: > >> - Robin Hood Hash Map: Replaced ihash with a custom map for > >> significantly tighter cache locality and faster lookups. > >> - O(1) Slab Allocator: Added a pre-allocated pool to make > transaction > >> buffers zero-allocation in the hot path. > >> - Unified Buffer Tracking: Eliminated the dual linked-list/map > >> structure in favor of just the map, fixing a synchronization bug from v2 > >> and simplifying the code. > >> > >> - Few other small bug fixes > >> - Refactored Dirty Block Hooks: Moved the journal_dirty_block calls > >> from inode.c directly into the ext2fs.h low-level block computation > >> functions (record_global_poke, sync_global_ptr, record_indir_poke, and > >> alloc_sync). This feels like a more natural fit and makes it much > easier to > >> ensure we aren't missing any call sites. > >> > >> Performance Benchmarks: > >> I ran repeated tests on my machine to measure the overhead, comparing > this > >> v3 journal implementation against Vanilla Hurd. > >> make ext2fs (CPU/Data bound - 5 runs): > >> Vanilla Hurd Average: ~2m 40.6s > >> Journal v3 Average: ~2m 41.3s > >> Result: Statistical tie. Journal overhead is practically zero. > >> > >> make clean && ../configure (Metadata bound - 5 runs): > >> Vanilla Hurd Average: ~3.90s (with latency spikes up to 4.29s) > >> Journal v3 Average: ~3.72s (rock-solid consistency, never breaking > >> 3.9s) > >> Result: Journaled ext2 is actually faster and more predictable here > >> due to the WAL absorbing random I/O. > >> > >> Crash Consistency Proof: > >> Beyond performance, I wanted to demonstrate the actual crash recovery in > >> action. > >> Boot Hurd, log in, create a directory (/home/loshmi/test-dir3). > >> Wait for the 5-second kjournald commit tick. > >> Hard crash the machine (kill -9 the QEMU process on the host). > >> > >> Inspecting from the Linux host before recovery shows the inode is > >> completely busted (as expected): > >> > >> sudo debugfs -R "stat /home/loshmi/test-dir3" /dev/nbd0 > >> > >> debugfs 1.47.3 (8-Jul-2025) > >> Inode: 373911 Type: bad type Mode: 0000 Flags: 0x0 > >> Generation: 0 Version: 0x00000000 > >> User: 0 Group: 0 Size: 0 > >> File ACL: 0 Translator: 0 > >> Links: 0 Blockcount: 0 > >> Fragment: Address: 0 Number: 0 Size: 0 > >> ctime: 0x00000000 -- Wed Dec 31 16:00:00 1969 > >> atime: 0x00000000 -- Wed Dec 31 16:00:00 1969 > >> mtime: 0x00000000 -- Wed Dec 31 16:00:00 1969 > >> BLOCKS: > >> > >> Note: On Vanilla Hurd, running fsck here would permanently lose the > >> directory or potentially cause further damage depending on luck. > >> > >> Triggering the journal replay: > >> sudo e2fsck -fy /dev/nbd0 > >> > >> Inspecting immediately after recovery: > >> > >> sudo debugfs -R "stat /home/loshmi/test-dir3" /dev/nbd0 > >> > >> debugfs 1.47.3 (8-Jul-2025) > >> Inode: 373911 Type: directory Mode: 0775 Flags: 0x0 > >> Generation: 1773077012 Version: 0x00000000 > >> User: 1001 Group: 1001 Size: 4096 > >> File ACL: 0 Translator: 0 > >> Links: 2 Blockcount: 8 > >> Fragment: Address: 0 Number: 0 Size: 0 > >> ctime: 0x69af0213 -- Mon Mar 9 10:23:31 2026 > >> atime: 0x69af0213 -- Mon Mar 9 10:23:31 2026 > >> mtime: 0x69af0213 -- Mon Mar 9 10:23:31 2026 > >> BLOCKS: > >> (0):1507738 > >> TOTAL: 1 > >> > >> The journal successfully reconstructed the directory, and logdump > confirms > >> the transactions were consumed perfectly. > >> > >> I have run similar hard-crash tests for rename, chmod, and chown etc > with > >> the same successful recovery results. > >> > >> I've attached the v3 diff. Let me know what you think of the new hash > map > >> and slab allocator approach! > >> > >> Best, > >> Milos > >> > >> > >> > >> > >> > >> On Fri, Mar 6, 2026 at 10:06 PM Milos Nikic <[email protected]> > >> wrote: > >> > >> > >> And here is the last one... > >> > >> I hacked up an improvement for journal_dirty_block to try and see if i > >> could speed it up a bit. > >> 1) Used specialized robin hood based hash table for speed (no tombstones > >> etc) (I took it from one of my personal projects....just specialized it > >> here a bit) > >> 2) used a small slab allocator to avoid malloc-ing in the hot path > >> 3) liberally sprinkled __rdtsc() to get a sense of cycle time inside > >> journal_dirty_block > >> > >> Got to say, just this simple local change managed to shave off 3-5% of > >> slowness. > >> > >> So my test is: > >> - Boot Hurd > >> - Inside Hurd go to the Hurd build directory > >> - run: > >> $ make clean && ../configure > >> $ time make ext2fs > >> > >> I do it multiple times for 3 different versions of ext2 libraries > >> 1) Vanilla Hurd (No Journal): ~avg, 151 seconds > >> > >> 2) Enhanced JBD2 (Slab + Custom Hash): ~159 seconds (5% slower!) > >> > >> 3) Baseline JBD2 (malloc + libihash what was sent in V2): ~168 seconds > >> > >> Of course there is a lot of variability, and my laptop is not a perfect > >> environment for these kinds of benchmarks, but this is what i have. > >> > >> My printouts on the screen show this: > >> ext2fs: part:5:device:wd0: warning: === JBD2 STATS === > >> ext2fs: part:5:device:wd0: warning: Total Dirty Calls: 339105 > >> ext2fs: part:5:device:wd0: warning: Total Function: 217101909 > >> cycles > >> ext2fs: part:5:device:wd0: warning: Total Lock Wait: 16741691 > cycles > >> ext2fs: part:5:device:wd0: warning: Total Alloc: 673363 > cycles > >> ext2fs: part:5:device:wd0: warning: Total Memcpy: 137938008 > >> cycles > >> ext2fs: part:5:device:wd0: warning: Total Hash Add: 258533 > cycles > >> ext2fs: part:5:device:wd0: warning: Total Hash Find: 29501960 > cycles > >> ext2fs: part:5:device:wd0: warning: --- AVERAGES (Amortized per call) > --- > >> ext2fs: part:5:device:wd0: warning: Avg Function Time: 640 cycles > >> ext2fs: part:5:device:wd0: warning: Avg Lock Wait: 49 cycles > >> ext2fs: part:5:device:wd0: warning: Avg Memcpy: 406 cycles > >> ext2fs: part:5:device:wd0: warning: Avg Malloc 1: 1 cycles > >> ext2fs: part:5:device:wd0: warning: Avg Hash Add: 0 cycles > >> ext2fs: part:5:device:wd0: warning: Avg Hash Find: 86 cycles > >> ext2fs: part:5:device:wd0: warning: ================== > >> > >> Averages here say a lot...with these improvements we are now down to > >> basically Memcpy time...and for copying 4096 bytes of ram Im not sure > we > >> can make it take less than 400 cycles...so we are hitting hardware > >> limitations. > >> It would be great if we could avoid memcpy here altogether or delay it > >> until commit or similar, and i have some ideas, but they all require > >> drastic changes across libdiskfs and ext2fs, not sure if a few remaining > >> percentage points of slowdown warrant that. > >> > >> Also, wow during ext2 compilation...this function (journal_dirty_block) > is > >> being called a bit more than 1000 times per second (for each and every > >> block that is ever being touched by the compiler) > >> > >> I am attaching here the altered journal.c with these changes if one is > >> interested in seeing the localized changes. > >> > >> Regards, > >> Milos > >> > >> On Fri, Mar 6, 2026 at 11:09 AM Milos Nikic <[email protected]> > >> wrote: > >> > >> > >> Hi Samuel, > >> > >> One quick detail I forgot to mention regarding the performance analysis: > >> > >> The entire ~0.4s performance impact I measured is isolated exclusively > to > >> journal_dirty_block. > >> > >> To verify this, I ran an experiment where I stubbed out > >> journal_dirty_block so it just returned immediately (which obviously > makes > >> for a very fast, but not very useful, journal!). With that single > function > >> bypassed, the filesystem performs identically to vanilla Hurd. > >> > >> This confirms that the background kjournald flusher, the transaction > >> reference counting, and the checkpointing logic add absolutely no > >> noticeable latency to the VFS. The overhead is strictly tied to the > physics > >> of the memory copying and hashmap lookups in that one block which we can > >> improve in subsequent patches. > >> > >> Thanks, Milos > >> > >> > >> On Fri, Mar 6, 2026 at 10:55 AM Milos Nikic <[email protected]> > >> wrote: > >> > >> > >> Hi Samuel, > >> > >> Thanks for reviewing my mental model on V1; I appreciate the detailed > >> feedback. > >> > >> Attached is the v2 patch. Here is a breakdown of the architectural > changes > >> and refactors based on your review: > >> > >> 1. diskfs_node_update and the Pager > >> Regarding the question, "Do we really want to update the node?": Yes, we > >> must update it with every change. JBD2 works strictly at the physical > block > >> level, not the abstract node cache level. To capture a node change in > the > >> journal, the block content must be physically serialized to the > transaction > >> buffer. Currently, this path is diskfs_node_update -> > diskfs_write_disknode > >> -> journal_dirty_block. > >> When wait is 0, this just copies the node details from the node-cache to > >> the pager. It is strictly an in-memory serialization and is extremely > fast. > >> I have updated the documentation for diskfs_node_update to explicitly > >> describe this behavior so future maintainers understand it isn't > triggering > >> synchronous disk I/O and doesn't measurably increase the latency of the > >> file system. > >> journal_dirty_block is not one of the most hammered functions in > >> libdiskfs/ext2 and more on that below. > >> > >> 2. Synchronous Wait & Factorization > >> I completely agree with your factorization advice: > >> write_disknode_journaled has been folded directly into > >> diskfs_write_disknode, making it much cleaner. > >> Regarding the wait flag: we are no longer ignoring it! Instead of > blocking > >> the VFS deeply in the stack, we now set an "IOU" flag on the > transaction. > >> This bubbles the sync requirement up to the outer RPC layer, which is > the > >> only place safe enough to actually sleep on the commit and thus maintain > >> the POSIX sync requirement without deadlocking etc. > >> > >> 3. Multiple Writes to the Same Metadata Block > >> "Can it happen that we write several times to the same metadata block?" > >> Yes, multiple nodes can live in the same block. However, because the > Mach > >> pager always flushes the "latest snapshot" of the block, we don't have > an > >> issue with mixed or stale data hitting the disk. > >> If RPCs hit while pager is actively writing that is all captured in the > >> "RUNNING TRANSACTION". If it happens that that RUNNING TRANSACTION has > the > >> same blocks pager is committing RUNNING TRANSACTION will be forcebliy > >> committed. > >> > >> 4. The New libdiskfs API > >> I added two new opaque accessors to diskfs.h: > >> > >> diskfs_journal_set_sync > >> diskfs_journal_needs_sync > >> > >> This allows inner nested functions to declare a strict need for a > >> POSIX sync without causing lock inversions. We only commit at the top > RPC > >> layer once the operation is fully complete and locks are dropped. > >> > >> 5. Cleanups & Ordering > >> Removed the redundant record_global_poke calls. > >> Reordered the pager write notification in journal.c to sit after the > >> committing function, as the pager write happens after the journal > commit. > >> Merged the ext2_journal checks inside > diskfs_journal_start_transaction > >> to return early. > >> Reverted the bold unlock moves. > >> Fixed the information leaks. > >> Elevated the deadlock/WAL bypass logs to ext2_warning. > >> > >> Performance: > >> I investigated the ~0.4s (increase from 4.9s to 5.3s) regression on my > SSD > >> during a heavy Hurd ../configure test. By stubbing out > journal_dirty_block, > >> performance returned to vanilla Hurd speeds, isolating the overhead to > that > >> specific function. > >> > >> A nanosecond profile reveals the cost is evenly split across the > mandatory > >> physics of a block journal: > >> > >> 25%: Lock Contention (Global transaction serialization) > >> > >> 22%: Memcpy (Shadowing the 4KB blocks) > >> > >> 21%: Hash Find (hurd_ihash lookups for block deduplication) > >> > >> I was surprised to see hurd_ihash taking up nearly a quarter of the > >> overhead. I added some collision mitigation, but left a further > >> improvements of this patch to keep the scope tight. In the future, we > could > >> drop the malloc entirely using a slab allocator and optimize the > hashmap to > >> get this overhead closer to zero (along with introducing a "frozen data" > >> concept like Linux does but that would be a bigger non localized > change). > >> > >> Final Note on Lock Hierarchy > >> The intended, deadlock-free use of the journal in libdiskfs is best > >> illustrated by the CHANGE_NODE_FIELD macro in libdiskfs/priv.h > >> txn = diskfs_journal_start_transaction (); > >> pthread_mutex_lock (&np->lock); > >> (OPERATION); > >> diskfs_node_update (np, diskfs_synchronous); > >> pthread_mutex_unlock (&np->lock); > >> if (diskfs_synchronous || diskfs_journal_needs_sync (txn)) > >> diskfs_journal_commit_transaction (txn); > >> else > >> diskfs_journal_stop_transaction (txn); > >> > >> By keeping journal operations strictly outside of the node > >> locking/unlocking phases, we treat it as the outermost "lock" on the > file > >> system, mathematically preventing deadlocks. > >> > >> Kind regards, > >> Milos > >> > >> > >> > >> On Thu, Mar 5, 2026 at 12:41 PM Samuel Thibault < > [email protected]> > >> wrote: > >> > >> > >> Hello, > >> > >> Milos Nikic, le jeu. 05 mars 2026 09:31:26 -0800, a ecrit: > >> > >> Hurd VFS works in 3 layers: > >> > >> 1. Node cache layer: The abstract node lives here and it is the ground > >> truth > >> of a running file system. When one does a stat myfile.txt, we get > the > >> information straight from the cache. When we create a new file, it > gets > >> placed in the cache, etc. > >> > >> 2. Pager layer: This is where nodes are serialized into the actual > >> physical > >> representation (4KB blocks) that will later be written to disk. > >> > >> 3. Hard drive: The physical storage that receives the bytes from the > >> pager. > >> > >> During normal operations (not a sync mount, fsync, etc.), the VFS > operates > >> almost entirely on Layer 1: The Node cache layer. This is why it's super > >> fast. > >> User changed atime? No problem. It just fetches a node from the node > cache > >> (hash table lookup, amortized to O(1)) and updates the struct in memory. > >> And > >> that is it. > >> > >> > >> Yes, so that we get as efficient as possible. > >> > >> Only when the sync interval hits (every 30 seconds by default) does the > >> Node > >> cache get iterated and serialized to the pager layer > >> (diskfs_sync_everything -> > >> write_all_disknodes -> write_node -> pager_sync). So basically, at that > >> moment, we create a snapshot of the state of the node cache and place it > >> onto > >> the pager(s). > >> > >> > >> It's not exactly a snapshot because the coherency between inodes and > >> data is not completely enforced (we write all disknodes before asking > >> the kernel to write back dirty pages, and then poke the writes). > >> > >> Even then, pager_sync is called with wait = 0. It is handed to the > pager, > >> which > >> sends it to Mach. At some later time (seconds or so later), Mach sends > it > >> back > >> to the ext2 pager, which finally issues store_write to write it to > Layer 3 > >> (The > >> Hard drive). And even that depends on how the driver reorders or delays > it. > >> > >> The effect of this architecture is that when store_write is finally > >> called, the > >> absolute latest version of the node cache snapshot is what gets written > to > >> the > >> storage. Is this basically correct? > >> > >> > >> It seems to be so indeed. > >> > >> Are there any edge cases or mechanics that are wrong in this model > >> that would make us receive a "stale" node cache snapshot? > >> > >> > >> Well, it can be "stale" if another RPC hasn't called > >> diskfs_node_update() yet, but that's what "safe" FS are all about: not > >> actually provide more than coherency of the content on the disk so fsck > >> is not suppposed to be needed. Then, if a program really wants coherency > >> between some files etc. it has to issue sync calls, dpkg does it for > >> instance. > >> > >> Samuel > >> > >> > >> >
