Hello, Thanks for trying to build this.
Yeah I think the hurd-0.9.git20251029 is a bit dated and doesn't contain all the necessary code in libstore and other libs that is needed ( the store_sync function etc) so it won't compile/work. For now it needs to be built from Hurd git repository Master branch. If that is an option. I could try and create a separate giant patch that encompasses all the changes since hurd-0.9.git20251029 , but Im not sure that is the right way. Regards, Milos On Thu, Mar 19, 2026, 7:07 AM gfleury <[email protected]> wrote: > Hi, > > I'm attempting to build Hurd with the JBD2 journaling patch on AMD64, but > I'm encountering a compilation error that I need help resolving. > > Here's what I'm doing: > > 1. apt update > 2. apt source hurd > 3. cd hurd-0.9.git20251029 > 4. patch -p1 < > ../v5-0001-ext2fs-Add-JBD2-journaling-to-ext2-libdiskfs.patch > 5. sudo dpkg-buildpackage -us -uc -b > > The build fails with this error: > > ``` > ../../ext2fs/journal.c: In function 'flush_to_disk': > ../../ext2fs/journal.c:592:17: error: implicit declaration of function > 'store_sync' [-Wimplicit-function-declaration] > 592 | error_t err = store_sync (store); > | ^~~~~~~~~~ > ``` > > Interestingly, the store_sync function doesn't seem to exist in > hurd-0.9.git20251029, but it is present in the current master branch. Could > this be a version compatibility issue with the patch? > > Any guidance would be appreciated. Thanks! > > Le 2026-03-18 01:04, Milos Nikic a écrit : > > Hello again Samuel, > > First of all, I want to apologize again for the patch churn over the > past week. I wanted to put this to rest properly, and I am now sending > my final, stable version. > > This is it. I have applied numerous fixes, performance tweaks, and > cleanups. I am happy to report that this now performs on par with > unjournaled ext2 on normal workloads, such as configuring/compiling > the Hurd, installing and reinstalling packages via APT, and untarring > large archives (like the Linux kernel). I have also heavily tested it > against artificial stress conditions (which I am happy to share if > there is interest), and it handles highly concurrent loads beautifully > without deadlocks or memory leaks. > > Progressive checkpointing ensures the filesystem runs smoothly, and > the feature remains strictly opt-in (until a partition is tuned with > tune2fs -j, the journal is completely inactive). > > The new API in libdiskfs is minimal but expressive enough to wrap all > filesystem operations in transactions and handle strict POSIX sync > barriers. > > Since v4, I have made several major architectural improvements: > > Smart Auto-Commit: diskfs_journal_stop_transaction now automatically > commits to disk if needs_sync has been flagged anywhere in the nested > RPC chain and the reference count drops to zero. > > Cleaned ext2 internal Journal API: I have exposed journal_store_write > and journal_store_read as block-device filter layers. Internal state > checks (journal_has_active_transaction, etc.) are now strictly hidden. > How the journal preserves the WAL property is now very obvious, as it > directly intercepts physical store operations. > > The "Lifeboat" Cache: Those store wrappers now utilize a small, > temporary internal cache to handle situations where the Mach VM pager > rushes blocks due to memory pressure. The Lifeboat seamlessly > intercepts and absorbs these hazard blocks without blocking the pager > or emitting warnings, even at peak write throughput. > > As before, I have added detailed comments across the patch to explain > the state machine and locking hierarchy. I know this is a complex > subsystem, so I am more than happy to write additional documentation > in whatever form is needed. > > Once again, apologies for the rapid iterations. I won't be touching > this code further until I hear your feedback. > > Kind regards, > Milos > > On Sun, Mar 15, 2026 at 9:01 PM Milos Nikic <[email protected]> > wrote: > > > Hi Samuel, > > I am writing to sincerely apologize for the insane amount of patch churn > over the last week. I know the rapid version bumps from v2 up to v4 have > been incredibly noisy, and I want to hit the brakes before you spend any > more time reviewing the current code. > > While running some extreme stress tests on a very small ext2 partition > with the tiniest (allowed by the tooling) journal, I caught a few critical > edge cases. While fixing those, I also realized that my libdiskfs VFS API > boundary is clunkier than it needs to be. I am currently rewriting it to > more closely match Linux's JBD2 semantics, where the VFS simply flags a > transaction for sync and calls stop, allowing the journal to auto-commit > when the reference count drops to zero. > > I'm also adding handling for cases where the Mach VM pager rushes blocks > to the disk while they are in the process of committing. This safely > intercepts them and will remove those warnings and WAL violations in almost > all cases. > > Please completely disregard v4. > > I promise the churn is coming to an end. I am going to take a little time > to finish this API contraction, stress-test it, polish it, and make sure it > is 100% rock-solid. I will be back soon with a finalized v5. > > Thanks for your patience with my crazy iteration process! > > Best, Milos > > > > On Thu, Mar 12, 2026 at 8:53 AM Milos Nikic <[email protected]> > wrote: > > > Hi Samuel, > > As promised, here is the thoroughly tested and benchmarked V4 revision of > the JBD2 Journal for Hurd. > > This revision addresses a major performance bottleneck present in V3 under > heavy concurrent workloads. The new design restores performance to match > vanilla Hurd's unjournaled ext2fs while preserving full crash consistency. > > Changes since V3: > - Removed eager memcpy() from the journal_dirty_block() hot-path. > - Introduced deferred block copying that triggers only when the > transaction becomes quiescent. > - Added a `needs_copy` flag to prevent redundant memory copies. > - Eliminated the severe lock contention and memory bandwidth pressure > observed in V3. > > Why the changes in v4 vs v3? > I have previously identified that the last remaining performance > bottleneck is memcpy of 4k byte every time journal_dirty_block is called. > And i was thinking about how to improve it. > Deferred copy comes to mind, But... > The Hurd VFS locks at the node level rather than the physical block level > (as Linux does). Because multiple nodes may share the same 4KB disk block, > naively deferring the journal copy until commit time can capture torn > writes if another thread is actively modifying a neighboring node in the > same block. > > Precisely because of this V3 performed a 4KB memcpy immediately inside > journal_dirty_block() (copy on write) while the node lock was held. While > safe, this placed expensive memory operations and global journal lock > contention directly in the VFS hot-path, causing severe slowdowns under > heavy parallel workloads. > > V4 removes this eager copy entirely by leveraging an existing transaction > invariant: > All VFS threads increment and decrement the active transaction’s > `t_updates` counter via the start/stop transaction functions. A transaction > cannot commit until this counter reaches zero. > When `t_updates == 0`, we are mathematically guaranteed that no VFS > threads are mutating blocks belonging to the transaction. At that exact > moment, the memory backing those blocks has fully settled and can be safely > copied without risk of torn writes. A perfect place for a deferred write! > > journal_dirty_block() now simply records the dirty block id in a hash > table, making the hot-path strictly O(1). (and this is why we have an > amazing performance boost between v3 and v4) > > But we also need to avoiding redundant copies: > Because transactions remain open for several seconds, `t_updates` may > bounce to zero and back up many times during a heavy workload (as multiple > VFS threads start/stop the transaction). To avoid repeatedly copying the > same unchanged blocks every time the counter hits zero, each shadow buffer > now contains a `needs_copy` flag. > > When a block is dirtied, the flag is set. When `t_updates` reaches zero, > only buffers with `needs_copy == 1` are copied to the shadow buffers, after > which the flag is cleared. > So two things need to be true in order for a block to be copied: 1) > t_updates must just hit 0 and 2) needs_copy needs to be 1 > > This architecture completely removes the hot-path bottleneck. Journaled > ext2fs now achieves performance virtually identical to vanilla ext2fs, even > under brutal concurrency (e.g. scripts doing heavy writes from multiple > shells at the same time). > > I know this is a dense patch with a lot to unpack. I've documented the > locking and Mach VM interactions as thoroughly as possible in the code > itself (roughly 1/3 of the lines are comments in ext2fs/journal.c), but I > understand there is only so much nuance that can fit into C comments. > If it would be helpful, I would be happy to draft a dedicated document > detailing the journal's lifecycle, its hooks into libdiskfs/ext2, and the > rationale behind the macro-transaction design, so future developers have a > clear reference. > > Looking forward to your thoughts. > > Best, > Milos > > > > > > On Tue, Mar 10, 2026 at 9:25 PM Milos Nikic <[email protected]> > wrote: > > > Hi Samuel, > > Just a quick heads-up: please hold off on reviewing this V3 series. > > While V3 version works fast for simple scenarios in single threaded > situations (like configure or make ext2fs etc) I fund that while running > some heavy multi-threaded stress tests on V3, a significant performance > degradation happens due to lock contention bottleneck caused by the eager > VFS memcpy hot-path. (memcpy inside journal_dirty_block which is called > 1000s of time a second really becomes a performance problem.) > > I have been working on a much cleaner approach that safely defers the > block copying to the quiescent transaction stop state. It completely > eliminates the VFS lock contention and brings the journaled performance > back to vanilla ext2fs levels even with many threads competing at > writing/reading/renaming in the same place. > > I am going to test this new architecture thoroughly over the next few days > and will send it as V4 once I am certain it is rock solid. > > Thanks! > > > > On Mon, Mar 9, 2026 at 12:15 PM Milos Nikic <[email protected]> > wrote: > > > Hello Samuel and the Hurd team, > > I am sending over v3 of the journaling patch. I know v2 is still pending > review, but while testing and profiling based on previous feedback, I > realized the standard mapping wasn't scaling well for metadata-heavy > workloads. I wanted to send this updated architecture your way to save you > from spending time reviewing the obsolete v2 code. > > This version keeps the core JBD2 logic from v2 but introduces several > structural optimizations, bug fixes, and code cleanups: > - Robin Hood Hash Map: Replaced ihash with a custom map for > significantly tighter cache locality and faster lookups. > - O(1) Slab Allocator: Added a pre-allocated pool to make transaction > buffers zero-allocation in the hot path. > - Unified Buffer Tracking: Eliminated the dual linked-list/map > structure in favor of just the map, fixing a synchronization bug from v2 > and simplifying the code. > > - Few other small bug fixes > - Refactored Dirty Block Hooks: Moved the journal_dirty_block calls > from inode.c directly into the ext2fs.h low-level block computation > functions (record_global_poke, sync_global_ptr, record_indir_poke, and > alloc_sync). This feels like a more natural fit and makes it much easier to > ensure we aren't missing any call sites. > > Performance Benchmarks: > I ran repeated tests on my machine to measure the overhead, comparing this > v3 journal implementation against Vanilla Hurd. > make ext2fs (CPU/Data bound - 5 runs): > Vanilla Hurd Average: ~2m 40.6s > Journal v3 Average: ~2m 41.3s > Result: Statistical tie. Journal overhead is practically zero. > > make clean && ../configure (Metadata bound - 5 runs): > Vanilla Hurd Average: ~3.90s (with latency spikes up to 4.29s) > Journal v3 Average: ~3.72s (rock-solid consistency, never breaking > 3.9s) > Result: Journaled ext2 is actually faster and more predictable here > due to the WAL absorbing random I/O. > > Crash Consistency Proof: > Beyond performance, I wanted to demonstrate the actual crash recovery in > action. > Boot Hurd, log in, create a directory (/home/loshmi/test-dir3). > Wait for the 5-second kjournald commit tick. > Hard crash the machine (kill -9 the QEMU process on the host). > > Inspecting from the Linux host before recovery shows the inode is > completely busted (as expected): > > sudo debugfs -R "stat /home/loshmi/test-dir3" /dev/nbd0 > > debugfs 1.47.3 (8-Jul-2025) > Inode: 373911 Type: bad type Mode: 0000 Flags: 0x0 > Generation: 0 Version: 0x00000000 > User: 0 Group: 0 Size: 0 > File ACL: 0 Translator: 0 > Links: 0 Blockcount: 0 > Fragment: Address: 0 Number: 0 Size: 0 > ctime: 0x00000000 -- Wed Dec 31 16:00:00 1969 > atime: 0x00000000 -- Wed Dec 31 16:00:00 1969 > mtime: 0x00000000 -- Wed Dec 31 16:00:00 1969 > BLOCKS: > > Note: On Vanilla Hurd, running fsck here would permanently lose the > directory or potentially cause further damage depending on luck. > > Triggering the journal replay: > sudo e2fsck -fy /dev/nbd0 > > Inspecting immediately after recovery: > > sudo debugfs -R "stat /home/loshmi/test-dir3" /dev/nbd0 > > debugfs 1.47.3 (8-Jul-2025) > Inode: 373911 Type: directory Mode: 0775 Flags: 0x0 > Generation: 1773077012 Version: 0x00000000 > User: 1001 Group: 1001 Size: 4096 > File ACL: 0 Translator: 0 > Links: 2 Blockcount: 8 > Fragment: Address: 0 Number: 0 Size: 0 > ctime: 0x69af0213 -- Mon Mar 9 10:23:31 2026 > atime: 0x69af0213 -- Mon Mar 9 10:23:31 2026 > mtime: 0x69af0213 -- Mon Mar 9 10:23:31 2026 > BLOCKS: > (0):1507738 > TOTAL: 1 > > The journal successfully reconstructed the directory, and logdump confirms > the transactions were consumed perfectly. > > I have run similar hard-crash tests for rename, chmod, and chown etc with > the same successful recovery results. > > I've attached the v3 diff. Let me know what you think of the new hash map > and slab allocator approach! > > Best, > Milos > > > > > > On Fri, Mar 6, 2026 at 10:06 PM Milos Nikic <[email protected]> > wrote: > > > And here is the last one... > > I hacked up an improvement for journal_dirty_block to try and see if i > could speed it up a bit. > 1) Used specialized robin hood based hash table for speed (no tombstones > etc) (I took it from one of my personal projects....just specialized it > here a bit) > 2) used a small slab allocator to avoid malloc-ing in the hot path > 3) liberally sprinkled __rdtsc() to get a sense of cycle time inside > journal_dirty_block > > Got to say, just this simple local change managed to shave off 3-5% of > slowness. > > So my test is: > - Boot Hurd > - Inside Hurd go to the Hurd build directory > - run: > $ make clean && ../configure > $ time make ext2fs > > I do it multiple times for 3 different versions of ext2 libraries > 1) Vanilla Hurd (No Journal): ~avg, 151 seconds > > 2) Enhanced JBD2 (Slab + Custom Hash): ~159 seconds (5% slower!) > > 3) Baseline JBD2 (malloc + libihash what was sent in V2): ~168 seconds > > Of course there is a lot of variability, and my laptop is not a perfect > environment for these kinds of benchmarks, but this is what i have. > > My printouts on the screen show this: > ext2fs: part:5:device:wd0: warning: === JBD2 STATS === > ext2fs: part:5:device:wd0: warning: Total Dirty Calls: 339105 > ext2fs: part:5:device:wd0: warning: Total Function: 217101909 > cycles > ext2fs: part:5:device:wd0: warning: Total Lock Wait: 16741691 cycles > ext2fs: part:5:device:wd0: warning: Total Alloc: 673363 cycles > ext2fs: part:5:device:wd0: warning: Total Memcpy: 137938008 > cycles > ext2fs: part:5:device:wd0: warning: Total Hash Add: 258533 cycles > ext2fs: part:5:device:wd0: warning: Total Hash Find: 29501960 cycles > ext2fs: part:5:device:wd0: warning: --- AVERAGES (Amortized per call) --- > ext2fs: part:5:device:wd0: warning: Avg Function Time: 640 cycles > ext2fs: part:5:device:wd0: warning: Avg Lock Wait: 49 cycles > ext2fs: part:5:device:wd0: warning: Avg Memcpy: 406 cycles > ext2fs: part:5:device:wd0: warning: Avg Malloc 1: 1 cycles > ext2fs: part:5:device:wd0: warning: Avg Hash Add: 0 cycles > ext2fs: part:5:device:wd0: warning: Avg Hash Find: 86 cycles > ext2fs: part:5:device:wd0: warning: ================== > > Averages here say a lot...with these improvements we are now down to > basically Memcpy time...and for copying 4096 bytes of ram Im not sure we > can make it take less than 400 cycles...so we are hitting hardware > limitations. > It would be great if we could avoid memcpy here altogether or delay it > until commit or similar, and i have some ideas, but they all require > drastic changes across libdiskfs and ext2fs, not sure if a few remaining > percentage points of slowdown warrant that. > > Also, wow during ext2 compilation...this function (journal_dirty_block) is > being called a bit more than 1000 times per second (for each and every > block that is ever being touched by the compiler) > > I am attaching here the altered journal.c with these changes if one is > interested in seeing the localized changes. > > Regards, > Milos > > On Fri, Mar 6, 2026 at 11:09 AM Milos Nikic <[email protected]> > wrote: > > > Hi Samuel, > > One quick detail I forgot to mention regarding the performance analysis: > > The entire ~0.4s performance impact I measured is isolated exclusively to > journal_dirty_block. > > To verify this, I ran an experiment where I stubbed out > journal_dirty_block so it just returned immediately (which obviously makes > for a very fast, but not very useful, journal!). With that single function > bypassed, the filesystem performs identically to vanilla Hurd. > > This confirms that the background kjournald flusher, the transaction > reference counting, and the checkpointing logic add absolutely no > noticeable latency to the VFS. The overhead is strictly tied to the physics > of the memory copying and hashmap lookups in that one block which we can > improve in subsequent patches. > > Thanks, Milos > > > On Fri, Mar 6, 2026 at 10:55 AM Milos Nikic <[email protected]> > wrote: > > > Hi Samuel, > > Thanks for reviewing my mental model on V1; I appreciate the detailed > feedback. > > Attached is the v2 patch. Here is a breakdown of the architectural changes > and refactors based on your review: > > 1. diskfs_node_update and the Pager > Regarding the question, "Do we really want to update the node?": Yes, we > must update it with every change. JBD2 works strictly at the physical block > level, not the abstract node cache level. To capture a node change in the > journal, the block content must be physically serialized to the transaction > buffer. Currently, this path is diskfs_node_update -> diskfs_write_disknode > -> journal_dirty_block. > When wait is 0, this just copies the node details from the node-cache to > the pager. It is strictly an in-memory serialization and is extremely fast. > I have updated the documentation for diskfs_node_update to explicitly > describe this behavior so future maintainers understand it isn't triggering > synchronous disk I/O and doesn't measurably increase the latency of the > file system. > journal_dirty_block is not one of the most hammered functions in > libdiskfs/ext2 and more on that below. > > 2. Synchronous Wait & Factorization > I completely agree with your factorization advice: > write_disknode_journaled has been folded directly into > diskfs_write_disknode, making it much cleaner. > Regarding the wait flag: we are no longer ignoring it! Instead of blocking > the VFS deeply in the stack, we now set an "IOU" flag on the transaction. > This bubbles the sync requirement up to the outer RPC layer, which is the > only place safe enough to actually sleep on the commit and thus maintain > the POSIX sync requirement without deadlocking etc. > > 3. Multiple Writes to the Same Metadata Block > "Can it happen that we write several times to the same metadata block?" > Yes, multiple nodes can live in the same block. However, because the Mach > pager always flushes the "latest snapshot" of the block, we don't have an > issue with mixed or stale data hitting the disk. > If RPCs hit while pager is actively writing that is all captured in the > "RUNNING TRANSACTION". If it happens that that RUNNING TRANSACTION has the > same blocks pager is committing RUNNING TRANSACTION will be forcebliy > committed. > > 4. The New libdiskfs API > I added two new opaque accessors to diskfs.h: > > diskfs_journal_set_sync > diskfs_journal_needs_sync > > This allows inner nested functions to declare a strict need for a > POSIX sync without causing lock inversions. We only commit at the top RPC > layer once the operation is fully complete and locks are dropped. > > 5. Cleanups & Ordering > Removed the redundant record_global_poke calls. > Reordered the pager write notification in journal.c to sit after the > committing function, as the pager write happens after the journal commit. > Merged the ext2_journal checks inside diskfs_journal_start_transaction > to return early. > Reverted the bold unlock moves. > Fixed the information leaks. > Elevated the deadlock/WAL bypass logs to ext2_warning. > > Performance: > I investigated the ~0.4s (increase from 4.9s to 5.3s) regression on my SSD > during a heavy Hurd ../configure test. By stubbing out journal_dirty_block, > performance returned to vanilla Hurd speeds, isolating the overhead to that > specific function. > > A nanosecond profile reveals the cost is evenly split across the mandatory > physics of a block journal: > > 25%: Lock Contention (Global transaction serialization) > > 22%: Memcpy (Shadowing the 4KB blocks) > > 21%: Hash Find (hurd_ihash lookups for block deduplication) > > I was surprised to see hurd_ihash taking up nearly a quarter of the > overhead. I added some collision mitigation, but left a further > improvements of this patch to keep the scope tight. In the future, we could > drop the malloc entirely using a slab allocator and optimize the hashmap to > get this overhead closer to zero (along with introducing a "frozen data" > concept like Linux does but that would be a bigger non localized change). > > Final Note on Lock Hierarchy > The intended, deadlock-free use of the journal in libdiskfs is best > illustrated by the CHANGE_NODE_FIELD macro in libdiskfs/priv.h > txn = diskfs_journal_start_transaction (); > pthread_mutex_lock (&np->lock); > (OPERATION); > diskfs_node_update (np, diskfs_synchronous); > pthread_mutex_unlock (&np->lock); > if (diskfs_synchronous || diskfs_journal_needs_sync (txn)) > diskfs_journal_commit_transaction (txn); > else > diskfs_journal_stop_transaction (txn); > > By keeping journal operations strictly outside of the node > locking/unlocking phases, we treat it as the outermost "lock" on the file > system, mathematically preventing deadlocks. > > Kind regards, > Milos > > > > On Thu, Mar 5, 2026 at 12:41 PM Samuel Thibault <[email protected]> > wrote: > > > Hello, > > Milos Nikic, le jeu. 05 mars 2026 09:31:26 -0800, a ecrit: > > Hurd VFS works in 3 layers: > > 1. Node cache layer: The abstract node lives here and it is the ground > truth > of a running file system. When one does a stat myfile.txt, we get the > information straight from the cache. When we create a new file, it gets > placed in the cache, etc. > > 2. Pager layer: This is where nodes are serialized into the actual > physical > representation (4KB blocks) that will later be written to disk. > > 3. Hard drive: The physical storage that receives the bytes from the > pager. > > During normal operations (not a sync mount, fsync, etc.), the VFS operates > almost entirely on Layer 1: The Node cache layer. This is why it's super > fast. > User changed atime? No problem. It just fetches a node from the node cache > (hash table lookup, amortized to O(1)) and updates the struct in memory. > And > that is it. > > > Yes, so that we get as efficient as possible. > > Only when the sync interval hits (every 30 seconds by default) does the > Node > cache get iterated and serialized to the pager layer > (diskfs_sync_everything -> > write_all_disknodes -> write_node -> pager_sync). So basically, at that > moment, we create a snapshot of the state of the node cache and place it > onto > the pager(s). > > > It's not exactly a snapshot because the coherency between inodes and > data is not completely enforced (we write all disknodes before asking > the kernel to write back dirty pages, and then poke the writes). > > Even then, pager_sync is called with wait = 0. It is handed to the pager, > which > sends it to Mach. At some later time (seconds or so later), Mach sends it > back > to the ext2 pager, which finally issues store_write to write it to Layer 3 > (The > Hard drive). And even that depends on how the driver reorders or delays it. > > The effect of this architecture is that when store_write is finally > called, the > absolute latest version of the node cache snapshot is what gets written to > the > storage. Is this basically correct? > > > It seems to be so indeed. > > Are there any edge cases or mechanics that are wrong in this model > that would make us receive a "stale" node cache snapshot? > > > Well, it can be "stale" if another RPC hasn't called > diskfs_node_update() yet, but that's what "safe" FS are all about: not > actually provide more than coherency of the content on the disk so fsck > is not suppposed to be needed. Then, if a program really wants coherency > between some files etc. it has to issue sync calls, dpkg does it for > instance. > > Samuel > > >
