Le 19 mars 2026 18:06:17 GMT+02:00, Milos Nikic <[email protected]> a écrit
:
>Hello,
>
>Thanks for trying to build this.
>
>Yeah I think the
>hurd-0.9.git20251029 is a bit dated and
>doesn't contain all the necessary code in libstore and other libs that is
>needed ( the store_sync function etc) so it won't compile/work.
>
>For now it needs to be built from Hurd git repository Master branch. If
>that is an option.
>
>I could try and create a separate giant patch that encompasses all the
>changes since
>hurd-0.9.git20251029 , but Im not sure that is the right way.
No need for that extra works. I will try master
>
>
>Regards,
>Milos
>
>
>
>
>
>
>
>
>
>
>
>On Thu, Mar 19, 2026, 7:07 AM gfleury <[email protected]> wrote:
>
>> Hi,
>>
>> I'm attempting to build Hurd with the JBD2 journaling patch on AMD64, but
>> I'm encountering a compilation error that I need help resolving.
>>
>> Here's what I'm doing:
>>
>> 1. apt update
>> 2. apt source hurd
>> 3. cd hurd-0.9.git20251029
>> 4. patch -p1 <
>> ../v5-0001-ext2fs-Add-JBD2-journaling-to-ext2-libdiskfs.patch
>> 5. sudo dpkg-buildpackage -us -uc -b
>>
>> The build fails with this error:
>>
>> ```
>> ../../ext2fs/journal.c: In function 'flush_to_disk':
>> ../../ext2fs/journal.c:592:17: error: implicit declaration of function
>> 'store_sync' [-Wimplicit-function-declaration]
>> 592 | error_t err = store_sync (store);
>> | ^~~~~~~~~~
>> ```
>>
>> Interestingly, the store_sync function doesn't seem to exist in
>> hurd-0.9.git20251029, but it is present in the current master branch. Could
>> this be a version compatibility issue with the patch?
>>
>> Any guidance would be appreciated. Thanks!
>>
>> Le 2026-03-18 01:04, Milos Nikic a écrit :
>>
>> Hello again Samuel,
>>
>> First of all, I want to apologize again for the patch churn over the
>> past week. I wanted to put this to rest properly, and I am now sending
>> my final, stable version.
>>
>> This is it. I have applied numerous fixes, performance tweaks, and
>> cleanups. I am happy to report that this now performs on par with
>> unjournaled ext2 on normal workloads, such as configuring/compiling
>> the Hurd, installing and reinstalling packages via APT, and untarring
>> large archives (like the Linux kernel). I have also heavily tested it
>> against artificial stress conditions (which I am happy to share if
>> there is interest), and it handles highly concurrent loads beautifully
>> without deadlocks or memory leaks.
>>
>> Progressive checkpointing ensures the filesystem runs smoothly, and
>> the feature remains strictly opt-in (until a partition is tuned with
>> tune2fs -j, the journal is completely inactive).
>>
>> The new API in libdiskfs is minimal but expressive enough to wrap all
>> filesystem operations in transactions and handle strict POSIX sync
>> barriers.
>>
>> Since v4, I have made several major architectural improvements:
>>
>> Smart Auto-Commit: diskfs_journal_stop_transaction now automatically
>> commits to disk if needs_sync has been flagged anywhere in the nested
>> RPC chain and the reference count drops to zero.
>>
>> Cleaned ext2 internal Journal API: I have exposed journal_store_write
>> and journal_store_read as block-device filter layers. Internal state
>> checks (journal_has_active_transaction, etc.) are now strictly hidden.
>> How the journal preserves the WAL property is now very obvious, as it
>> directly intercepts physical store operations.
>>
>> The "Lifeboat" Cache: Those store wrappers now utilize a small,
>> temporary internal cache to handle situations where the Mach VM pager
>> rushes blocks due to memory pressure. The Lifeboat seamlessly
>> intercepts and absorbs these hazard blocks without blocking the pager
>> or emitting warnings, even at peak write throughput.
>>
>> As before, I have added detailed comments across the patch to explain
>> the state machine and locking hierarchy. I know this is a complex
>> subsystem, so I am more than happy to write additional documentation
>> in whatever form is needed.
>>
>> Once again, apologies for the rapid iterations. I won't be touching
>> this code further until I hear your feedback.
>>
>> Kind regards,
>> Milos
>>
>> On Sun, Mar 15, 2026 at 9:01 PM Milos Nikic <[email protected]>
>> wrote:
>>
>>
>> Hi Samuel,
>>
>> I am writing to sincerely apologize for the insane amount of patch churn
>> over the last week. I know the rapid version bumps from v2 up to v4 have
>> been incredibly noisy, and I want to hit the brakes before you spend any
>> more time reviewing the current code.
>>
>> While running some extreme stress tests on a very small ext2 partition
>> with the tiniest (allowed by the tooling) journal, I caught a few critical
>> edge cases. While fixing those, I also realized that my libdiskfs VFS API
>> boundary is clunkier than it needs to be. I am currently rewriting it to
>> more closely match Linux's JBD2 semantics, where the VFS simply flags a
>> transaction for sync and calls stop, allowing the journal to auto-commit
>> when the reference count drops to zero.
>>
>> I'm also adding handling for cases where the Mach VM pager rushes blocks
>> to the disk while they are in the process of committing. This safely
>> intercepts them and will remove those warnings and WAL violations in almost
>> all cases.
>>
>> Please completely disregard v4.
>>
>> I promise the churn is coming to an end. I am going to take a little time
>> to finish this API contraction, stress-test it, polish it, and make sure it
>> is 100% rock-solid. I will be back soon with a finalized v5.
>>
>> Thanks for your patience with my crazy iteration process!
>>
>> Best, Milos
>>
>>
>>
>> On Thu, Mar 12, 2026 at 8:53 AM Milos Nikic <[email protected]>
>> wrote:
>>
>>
>> Hi Samuel,
>>
>> As promised, here is the thoroughly tested and benchmarked V4 revision of
>> the JBD2 Journal for Hurd.
>>
>> This revision addresses a major performance bottleneck present in V3 under
>> heavy concurrent workloads. The new design restores performance to match
>> vanilla Hurd's unjournaled ext2fs while preserving full crash consistency.
>>
>> Changes since V3:
>> - Removed eager memcpy() from the journal_dirty_block() hot-path.
>> - Introduced deferred block copying that triggers only when the
>> transaction becomes quiescent.
>> - Added a `needs_copy` flag to prevent redundant memory copies.
>> - Eliminated the severe lock contention and memory bandwidth pressure
>> observed in V3.
>>
>> Why the changes in v4 vs v3?
>> I have previously identified that the last remaining performance
>> bottleneck is memcpy of 4k byte every time journal_dirty_block is called.
>> And i was thinking about how to improve it.
>> Deferred copy comes to mind, But...
>> The Hurd VFS locks at the node level rather than the physical block level
>> (as Linux does). Because multiple nodes may share the same 4KB disk block,
>> naively deferring the journal copy until commit time can capture torn
>> writes if another thread is actively modifying a neighboring node in the
>> same block.
>>
>> Precisely because of this V3 performed a 4KB memcpy immediately inside
>> journal_dirty_block() (copy on write) while the node lock was held. While
>> safe, this placed expensive memory operations and global journal lock
>> contention directly in the VFS hot-path, causing severe slowdowns under
>> heavy parallel workloads.
>>
>> V4 removes this eager copy entirely by leveraging an existing transaction
>> invariant:
>> All VFS threads increment and decrement the active transaction’s
>> `t_updates` counter via the start/stop transaction functions. A transaction
>> cannot commit until this counter reaches zero.
>> When `t_updates == 0`, we are mathematically guaranteed that no VFS
>> threads are mutating blocks belonging to the transaction. At that exact
>> moment, the memory backing those blocks has fully settled and can be safely
>> copied without risk of torn writes. A perfect place for a deferred write!
>>
>> journal_dirty_block() now simply records the dirty block id in a hash
>> table, making the hot-path strictly O(1). (and this is why we have an
>> amazing performance boost between v3 and v4)
>>
>> But we also need to avoiding redundant copies:
>> Because transactions remain open for several seconds, `t_updates` may
>> bounce to zero and back up many times during a heavy workload (as multiple
>> VFS threads start/stop the transaction). To avoid repeatedly copying the
>> same unchanged blocks every time the counter hits zero, each shadow buffer
>> now contains a `needs_copy` flag.
>>
>> When a block is dirtied, the flag is set. When `t_updates` reaches zero,
>> only buffers with `needs_copy == 1` are copied to the shadow buffers, after
>> which the flag is cleared.
>> So two things need to be true in order for a block to be copied: 1)
>> t_updates must just hit 0 and 2) needs_copy needs to be 1
>>
>> This architecture completely removes the hot-path bottleneck. Journaled
>> ext2fs now achieves performance virtually identical to vanilla ext2fs, even
>> under brutal concurrency (e.g. scripts doing heavy writes from multiple
>> shells at the same time).
>>
>> I know this is a dense patch with a lot to unpack. I've documented the
>> locking and Mach VM interactions as thoroughly as possible in the code
>> itself (roughly 1/3 of the lines are comments in ext2fs/journal.c), but I
>> understand there is only so much nuance that can fit into C comments.
>> If it would be helpful, I would be happy to draft a dedicated document
>> detailing the journal's lifecycle, its hooks into libdiskfs/ext2, and the
>> rationale behind the macro-transaction design, so future developers have a
>> clear reference.
>>
>> Looking forward to your thoughts.
>>
>> Best,
>> Milos
>>
>>
>>
>>
>>
>> On Tue, Mar 10, 2026 at 9:25 PM Milos Nikic <[email protected]>
>> wrote:
>>
>>
>> Hi Samuel,
>>
>> Just a quick heads-up: please hold off on reviewing this V3 series.
>>
>> While V3 version works fast for simple scenarios in single threaded
>> situations (like configure or make ext2fs etc) I fund that while running
>> some heavy multi-threaded stress tests on V3, a significant performance
>> degradation happens due to lock contention bottleneck caused by the eager
>> VFS memcpy hot-path. (memcpy inside journal_dirty_block which is called
>> 1000s of time a second really becomes a performance problem.)
>>
>> I have been working on a much cleaner approach that safely defers the
>> block copying to the quiescent transaction stop state. It completely
>> eliminates the VFS lock contention and brings the journaled performance
>> back to vanilla ext2fs levels even with many threads competing at
>> writing/reading/renaming in the same place.
>>
>> I am going to test this new architecture thoroughly over the next few days
>> and will send it as V4 once I am certain it is rock solid.
>>
>> Thanks!
>>
>>
>>
>> On Mon, Mar 9, 2026 at 12:15 PM Milos Nikic <[email protected]>
>> wrote:
>>
>>
>> Hello Samuel and the Hurd team,
>>
>> I am sending over v3 of the journaling patch. I know v2 is still pending
>> review, but while testing and profiling based on previous feedback, I
>> realized the standard mapping wasn't scaling well for metadata-heavy
>> workloads. I wanted to send this updated architecture your way to save you
>> from spending time reviewing the obsolete v2 code.
>>
>> This version keeps the core JBD2 logic from v2 but introduces several
>> structural optimizations, bug fixes, and code cleanups:
>> - Robin Hood Hash Map: Replaced ihash with a custom map for
>> significantly tighter cache locality and faster lookups.
>> - O(1) Slab Allocator: Added a pre-allocated pool to make transaction
>> buffers zero-allocation in the hot path.
>> - Unified Buffer Tracking: Eliminated the dual linked-list/map
>> structure in favor of just the map, fixing a synchronization bug from v2
>> and simplifying the code.
>>
>> - Few other small bug fixes
>> - Refactored Dirty Block Hooks: Moved the journal_dirty_block calls
>> from inode.c directly into the ext2fs.h low-level block computation
>> functions (record_global_poke, sync_global_ptr, record_indir_poke, and
>> alloc_sync). This feels like a more natural fit and makes it much easier to
>> ensure we aren't missing any call sites.
>>
>> Performance Benchmarks:
>> I ran repeated tests on my machine to measure the overhead, comparing this
>> v3 journal implementation against Vanilla Hurd.
>> make ext2fs (CPU/Data bound - 5 runs):
>> Vanilla Hurd Average: ~2m 40.6s
>> Journal v3 Average: ~2m 41.3s
>> Result: Statistical tie. Journal overhead is practically zero.
>>
>> make clean && ../configure (Metadata bound - 5 runs):
>> Vanilla Hurd Average: ~3.90s (with latency spikes up to 4.29s)
>> Journal v3 Average: ~3.72s (rock-solid consistency, never breaking
>> 3.9s)
>> Result: Journaled ext2 is actually faster and more predictable here
>> due to the WAL absorbing random I/O.
>>
>> Crash Consistency Proof:
>> Beyond performance, I wanted to demonstrate the actual crash recovery in
>> action.
>> Boot Hurd, log in, create a directory (/home/loshmi/test-dir3).
>> Wait for the 5-second kjournald commit tick.
>> Hard crash the machine (kill -9 the QEMU process on the host).
>>
>> Inspecting from the Linux host before recovery shows the inode is
>> completely busted (as expected):
>>
>> sudo debugfs -R "stat /home/loshmi/test-dir3" /dev/nbd0
>>
>> debugfs 1.47.3 (8-Jul-2025)
>> Inode: 373911 Type: bad type Mode: 0000 Flags: 0x0
>> Generation: 0 Version: 0x00000000
>> User: 0 Group: 0 Size: 0
>> File ACL: 0 Translator: 0
>> Links: 0 Blockcount: 0
>> Fragment: Address: 0 Number: 0 Size: 0
>> ctime: 0x00000000 -- Wed Dec 31 16:00:00 1969
>> atime: 0x00000000 -- Wed Dec 31 16:00:00 1969
>> mtime: 0x00000000 -- Wed Dec 31 16:00:00 1969
>> BLOCKS:
>>
>> Note: On Vanilla Hurd, running fsck here would permanently lose the
>> directory or potentially cause further damage depending on luck.
>>
>> Triggering the journal replay:
>> sudo e2fsck -fy /dev/nbd0
>>
>> Inspecting immediately after recovery:
>>
>> sudo debugfs -R "stat /home/loshmi/test-dir3" /dev/nbd0
>>
>> debugfs 1.47.3 (8-Jul-2025)
>> Inode: 373911 Type: directory Mode: 0775 Flags: 0x0
>> Generation: 1773077012 Version: 0x00000000
>> User: 1001 Group: 1001 Size: 4096
>> File ACL: 0 Translator: 0
>> Links: 2 Blockcount: 8
>> Fragment: Address: 0 Number: 0 Size: 0
>> ctime: 0x69af0213 -- Mon Mar 9 10:23:31 2026
>> atime: 0x69af0213 -- Mon Mar 9 10:23:31 2026
>> mtime: 0x69af0213 -- Mon Mar 9 10:23:31 2026
>> BLOCKS:
>> (0):1507738
>> TOTAL: 1
>>
>> The journal successfully reconstructed the directory, and logdump confirms
>> the transactions were consumed perfectly.
>>
>> I have run similar hard-crash tests for rename, chmod, and chown etc with
>> the same successful recovery results.
>>
>> I've attached the v3 diff. Let me know what you think of the new hash map
>> and slab allocator approach!
>>
>> Best,
>> Milos
>>
>>
>>
>>
>>
>> On Fri, Mar 6, 2026 at 10:06 PM Milos Nikic <[email protected]>
>> wrote:
>>
>>
>> And here is the last one...
>>
>> I hacked up an improvement for journal_dirty_block to try and see if i
>> could speed it up a bit.
>> 1) Used specialized robin hood based hash table for speed (no tombstones
>> etc) (I took it from one of my personal projects....just specialized it
>> here a bit)
>> 2) used a small slab allocator to avoid malloc-ing in the hot path
>> 3) liberally sprinkled __rdtsc() to get a sense of cycle time inside
>> journal_dirty_block
>>
>> Got to say, just this simple local change managed to shave off 3-5% of
>> slowness.
>>
>> So my test is:
>> - Boot Hurd
>> - Inside Hurd go to the Hurd build directory
>> - run:
>> $ make clean && ../configure
>> $ time make ext2fs
>>
>> I do it multiple times for 3 different versions of ext2 libraries
>> 1) Vanilla Hurd (No Journal): ~avg, 151 seconds
>>
>> 2) Enhanced JBD2 (Slab + Custom Hash): ~159 seconds (5% slower!)
>>
>> 3) Baseline JBD2 (malloc + libihash what was sent in V2): ~168 seconds
>>
>> Of course there is a lot of variability, and my laptop is not a perfect
>> environment for these kinds of benchmarks, but this is what i have.
>>
>> My printouts on the screen show this:
>> ext2fs: part:5:device:wd0: warning: === JBD2 STATS ===
>> ext2fs: part:5:device:wd0: warning: Total Dirty Calls: 339105
>> ext2fs: part:5:device:wd0: warning: Total Function: 217101909
>> cycles
>> ext2fs: part:5:device:wd0: warning: Total Lock Wait: 16741691 cycles
>> ext2fs: part:5:device:wd0: warning: Total Alloc: 673363 cycles
>> ext2fs: part:5:device:wd0: warning: Total Memcpy: 137938008
>> cycles
>> ext2fs: part:5:device:wd0: warning: Total Hash Add: 258533 cycles
>> ext2fs: part:5:device:wd0: warning: Total Hash Find: 29501960 cycles
>> ext2fs: part:5:device:wd0: warning: --- AVERAGES (Amortized per call) ---
>> ext2fs: part:5:device:wd0: warning: Avg Function Time: 640 cycles
>> ext2fs: part:5:device:wd0: warning: Avg Lock Wait: 49 cycles
>> ext2fs: part:5:device:wd0: warning: Avg Memcpy: 406 cycles
>> ext2fs: part:5:device:wd0: warning: Avg Malloc 1: 1 cycles
>> ext2fs: part:5:device:wd0: warning: Avg Hash Add: 0 cycles
>> ext2fs: part:5:device:wd0: warning: Avg Hash Find: 86 cycles
>> ext2fs: part:5:device:wd0: warning: ==================
>>
>> Averages here say a lot...with these improvements we are now down to
>> basically Memcpy time...and for copying 4096 bytes of ram Im not sure we
>> can make it take less than 400 cycles...so we are hitting hardware
>> limitations.
>> It would be great if we could avoid memcpy here altogether or delay it
>> until commit or similar, and i have some ideas, but they all require
>> drastic changes across libdiskfs and ext2fs, not sure if a few remaining
>> percentage points of slowdown warrant that.
>>
>> Also, wow during ext2 compilation...this function (journal_dirty_block) is
>> being called a bit more than 1000 times per second (for each and every
>> block that is ever being touched by the compiler)
>>
>> I am attaching here the altered journal.c with these changes if one is
>> interested in seeing the localized changes.
>>
>> Regards,
>> Milos
>>
>> On Fri, Mar 6, 2026 at 11:09 AM Milos Nikic <[email protected]>
>> wrote:
>>
>>
>> Hi Samuel,
>>
>> One quick detail I forgot to mention regarding the performance analysis:
>>
>> The entire ~0.4s performance impact I measured is isolated exclusively to
>> journal_dirty_block.
>>
>> To verify this, I ran an experiment where I stubbed out
>> journal_dirty_block so it just returned immediately (which obviously makes
>> for a very fast, but not very useful, journal!). With that single function
>> bypassed, the filesystem performs identically to vanilla Hurd.
>>
>> This confirms that the background kjournald flusher, the transaction
>> reference counting, and the checkpointing logic add absolutely no
>> noticeable latency to the VFS. The overhead is strictly tied to the physics
>> of the memory copying and hashmap lookups in that one block which we can
>> improve in subsequent patches.
>>
>> Thanks, Milos
>>
>>
>> On Fri, Mar 6, 2026 at 10:55 AM Milos Nikic <[email protected]>
>> wrote:
>>
>>
>> Hi Samuel,
>>
>> Thanks for reviewing my mental model on V1; I appreciate the detailed
>> feedback.
>>
>> Attached is the v2 patch. Here is a breakdown of the architectural changes
>> and refactors based on your review:
>>
>> 1. diskfs_node_update and the Pager
>> Regarding the question, "Do we really want to update the node?": Yes, we
>> must update it with every change. JBD2 works strictly at the physical block
>> level, not the abstract node cache level. To capture a node change in the
>> journal, the block content must be physically serialized to the transaction
>> buffer. Currently, this path is diskfs_node_update -> diskfs_write_disknode
>> -> journal_dirty_block.
>> When wait is 0, this just copies the node details from the node-cache to
>> the pager. It is strictly an in-memory serialization and is extremely fast.
>> I have updated the documentation for diskfs_node_update to explicitly
>> describe this behavior so future maintainers understand it isn't triggering
>> synchronous disk I/O and doesn't measurably increase the latency of the
>> file system.
>> journal_dirty_block is not one of the most hammered functions in
>> libdiskfs/ext2 and more on that below.
>>
>> 2. Synchronous Wait & Factorization
>> I completely agree with your factorization advice:
>> write_disknode_journaled has been folded directly into
>> diskfs_write_disknode, making it much cleaner.
>> Regarding the wait flag: we are no longer ignoring it! Instead of blocking
>> the VFS deeply in the stack, we now set an "IOU" flag on the transaction.
>> This bubbles the sync requirement up to the outer RPC layer, which is the
>> only place safe enough to actually sleep on the commit and thus maintain
>> the POSIX sync requirement without deadlocking etc.
>>
>> 3. Multiple Writes to the Same Metadata Block
>> "Can it happen that we write several times to the same metadata block?"
>> Yes, multiple nodes can live in the same block. However, because the Mach
>> pager always flushes the "latest snapshot" of the block, we don't have an
>> issue with mixed or stale data hitting the disk.
>> If RPCs hit while pager is actively writing that is all captured in the
>> "RUNNING TRANSACTION". If it happens that that RUNNING TRANSACTION has the
>> same blocks pager is committing RUNNING TRANSACTION will be forcebliy
>> committed.
>>
>> 4. The New libdiskfs API
>> I added two new opaque accessors to diskfs.h:
>>
>> diskfs_journal_set_sync
>> diskfs_journal_needs_sync
>>
>> This allows inner nested functions to declare a strict need for a
>> POSIX sync without causing lock inversions. We only commit at the top RPC
>> layer once the operation is fully complete and locks are dropped.
>>
>> 5. Cleanups & Ordering
>> Removed the redundant record_global_poke calls.
>> Reordered the pager write notification in journal.c to sit after the
>> committing function, as the pager write happens after the journal commit.
>> Merged the ext2_journal checks inside diskfs_journal_start_transaction
>> to return early.
>> Reverted the bold unlock moves.
>> Fixed the information leaks.
>> Elevated the deadlock/WAL bypass logs to ext2_warning.
>>
>> Performance:
>> I investigated the ~0.4s (increase from 4.9s to 5.3s) regression on my SSD
>> during a heavy Hurd ../configure test. By stubbing out journal_dirty_block,
>> performance returned to vanilla Hurd speeds, isolating the overhead to that
>> specific function.
>>
>> A nanosecond profile reveals the cost is evenly split across the mandatory
>> physics of a block journal:
>>
>> 25%: Lock Contention (Global transaction serialization)
>>
>> 22%: Memcpy (Shadowing the 4KB blocks)
>>
>> 21%: Hash Find (hurd_ihash lookups for block deduplication)
>>
>> I was surprised to see hurd_ihash taking up nearly a quarter of the
>> overhead. I added some collision mitigation, but left a further
>> improvements of this patch to keep the scope tight. In the future, we could
>> drop the malloc entirely using a slab allocator and optimize the hashmap to
>> get this overhead closer to zero (along with introducing a "frozen data"
>> concept like Linux does but that would be a bigger non localized change).
>>
>> Final Note on Lock Hierarchy
>> The intended, deadlock-free use of the journal in libdiskfs is best
>> illustrated by the CHANGE_NODE_FIELD macro in libdiskfs/priv.h
>> txn = diskfs_journal_start_transaction ();
>> pthread_mutex_lock (&np->lock);
>> (OPERATION);
>> diskfs_node_update (np, diskfs_synchronous);
>> pthread_mutex_unlock (&np->lock);
>> if (diskfs_synchronous || diskfs_journal_needs_sync (txn))
>> diskfs_journal_commit_transaction (txn);
>> else
>> diskfs_journal_stop_transaction (txn);
>>
>> By keeping journal operations strictly outside of the node
>> locking/unlocking phases, we treat it as the outermost "lock" on the file
>> system, mathematically preventing deadlocks.
>>
>> Kind regards,
>> Milos
>>
>>
>>
>> On Thu, Mar 5, 2026 at 12:41 PM Samuel Thibault <[email protected]>
>> wrote:
>>
>>
>> Hello,
>>
>> Milos Nikic, le jeu. 05 mars 2026 09:31:26 -0800, a ecrit:
>>
>> Hurd VFS works in 3 layers:
>>
>> 1. Node cache layer: The abstract node lives here and it is the ground
>> truth
>> of a running file system. When one does a stat myfile.txt, we get the
>> information straight from the cache. When we create a new file, it gets
>> placed in the cache, etc.
>>
>> 2. Pager layer: This is where nodes are serialized into the actual
>> physical
>> representation (4KB blocks) that will later be written to disk.
>>
>> 3. Hard drive: The physical storage that receives the bytes from the
>> pager.
>>
>> During normal operations (not a sync mount, fsync, etc.), the VFS operates
>> almost entirely on Layer 1: The Node cache layer. This is why it's super
>> fast.
>> User changed atime? No problem. It just fetches a node from the node cache
>> (hash table lookup, amortized to O(1)) and updates the struct in memory.
>> And
>> that is it.
>>
>>
>> Yes, so that we get as efficient as possible.
>>
>> Only when the sync interval hits (every 30 seconds by default) does the
>> Node
>> cache get iterated and serialized to the pager layer
>> (diskfs_sync_everything ->
>> write_all_disknodes -> write_node -> pager_sync). So basically, at that
>> moment, we create a snapshot of the state of the node cache and place it
>> onto
>> the pager(s).
>>
>>
>> It's not exactly a snapshot because the coherency between inodes and
>> data is not completely enforced (we write all disknodes before asking
>> the kernel to write back dirty pages, and then poke the writes).
>>
>> Even then, pager_sync is called with wait = 0. It is handed to the pager,
>> which
>> sends it to Mach. At some later time (seconds or so later), Mach sends it
>> back
>> to the ext2 pager, which finally issues store_write to write it to Layer 3
>> (The
>> Hard drive). And even that depends on how the driver reorders or delays it.
>>
>> The effect of this architecture is that when store_write is finally
>> called, the
>> absolute latest version of the node cache snapshot is what gets written to
>> the
>> storage. Is this basically correct?
>>
>>
>> It seems to be so indeed.
>>
>> Are there any edge cases or mechanics that are wrong in this model
>> that would make us receive a "stale" node cache snapshot?
>>
>>
>> Well, it can be "stale" if another RPC hasn't called
>> diskfs_node_update() yet, but that's what "safe" FS are all about: not
>> actually provide more than coherency of the content on the disk so fsck
>> is not suppposed to be needed. Then, if a program really wants coherency
>> between some files etc. it has to issue sync calls, dpkg does it for
>> instance.
>>
>> Samuel
>>
>>
>>