Re: [PATCH 1/3] ext2fs: Add JBD2 on-disk layout headers

Milos Nikic Sun, 15 Mar 2026 21:02:42 -0700

Hi Samuel,

I am writing to sincerely apologize for the insane amount of patch churn
over the last week. I know the rapid version bumps from v2 up to v4 have
been incredibly noisy, and I want to hit the brakes before you spend any
more time reviewing the current code.


While running some extreme stress tests on a very small ext2 partition with
the tiniest (allowed by the tooling) journal, I caught a few critical edge
cases. While fixing those, I also realized that my libdiskfs VFS API
boundary is clunkier than it needs to be. I am currently rewriting it to
more closely match Linux's JBD2 semantics, where the VFS simply flags a
transaction for sync and calls stop, allowing the journal to auto-commit
when the reference count drops to zero.

I'm also adding handling for cases where the Mach VM pager rushes blocks to
the disk while they are in the process of committing. This safely
intercepts them and will remove those warnings and WAL violations in almost
all cases.

Please completely disregard v4.

I promise the churn is coming to an end. I am going to take a little time
to finish this API contraction, stress-test it, polish it, and make sure it
is 100% rock-solid. I will be back soon with a finalized v5.

Thanks for your patience with my crazy iteration process!

Best, Milos


On Thu, Mar 12, 2026 at 8:53 AM Milos Nikic <[email protected]> wrote:

> Hi Samuel,
>
> As promised, here is the thoroughly tested and benchmarked V4 revision of
> the JBD2 Journal for Hurd.
>
> This revision addresses a major performance bottleneck present in V3 under
> heavy concurrent workloads. The new design restores performance to match
> vanilla Hurd's unjournaled ext2fs while preserving full crash consistency.
>
> Changes since V3:
> - Removed eager memcpy() from the journal_dirty_block() hot-path.
> - Introduced deferred block copying that triggers only when the
> transaction becomes quiescent.
> - Added a `needs_copy` flag to prevent redundant memory copies.
> - Eliminated the severe lock contention and memory bandwidth pressure
> observed in V3.
>
> Why the changes in v4 vs v3?
> I have previously identified that the last remaining performance
> bottleneck is memcpy of 4k byte every time journal_dirty_block is called.
> And i was thinking about how to improve it.
> Deferred copy comes to mind, But...
> The Hurd VFS locks at the node level rather than the physical block level
> (as Linux does). Because multiple nodes may share the same 4KB disk block,
> naively deferring the journal copy until commit time can capture torn
> writes if another thread is actively modifying a neighboring node in the
> same block.
>
> Precisely because of this V3 performed a 4KB memcpy immediately inside
> journal_dirty_block() (copy on write) while the node lock was held. While
> safe, this placed expensive memory operations and global journal lock
> contention directly in the VFS hot-path, causing severe slowdowns under
> heavy parallel workloads.
>
> V4 removes this eager copy entirely by leveraging an existing transaction
> invariant:
> All VFS threads increment and decrement the active transaction’s
> `t_updates` counter via the start/stop transaction functions. A transaction
> cannot commit until this counter reaches zero.
> When `t_updates == 0`, we are *mathematically guaranteed that no VFS
> threads are mutating blocks belonging to the transaction.* At that exact
> moment, the memory backing those blocks has fully settled and can be safely
> copied without risk of torn writes. A perfect place for a deferred write!
>
> journal_dirty_block() now simply records the dirty block id in a hash
> table, making the hot-path strictly O(1). (and this is why we have an
> amazing performance boost between v3 and v4)
>
> But we also need to avoiding redundant copies:
> Because transactions remain open for several seconds, `t_updates` may
> bounce to zero and back up many times during a heavy workload (as multiple
> VFS threads start/stop the transaction). To avoid repeatedly copying the
> same unchanged blocks every time the counter hits zero, each shadow buffer
> now contains a `needs_copy` flag.
>
> When a block is dirtied, the flag is set. When `t_updates` reaches zero,
> only buffers with `needs_copy == 1` are copied to the shadow buffers, after
> which the flag is cleared.
> So two things need to be true in order for a block to be copied: 1)
> t_updates must just hit 0 and 2) needs_copy needs to be 1
>
> This architecture completely removes the hot-path bottleneck. Journaled
> ext2fs now achieves performance virtually identical to vanilla ext2fs, even
> under brutal concurrency (e.g. scripts doing heavy writes from multiple
> shells at the same time).
>
> I know this is a dense patch with a lot to unpack. I've documented the
> locking and Mach VM interactions as thoroughly as possible in the code
> itself (roughly 1/3 of the lines are comments in ext2fs/journal.c), but I
> understand there is only so much nuance that can fit into C comments.
> If it would be helpful, I would be happy to draft a dedicated document
> detailing the journal's lifecycle, its hooks into libdiskfs/ext2, and the
> rationale behind the macro-transaction design, so future developers have a
> clear reference.
>
> Looking forward to your thoughts.
>
> Best,
> Milos
>
>
>
>
>
> On Tue, Mar 10, 2026 at 9:25 PM Milos Nikic <[email protected]> wrote:
>
>> Hi Samuel,
>>
>> Just a quick heads-up: please hold off on reviewing this V3 series.
>>
>> While V3 version works fast for simple scenarios in single threaded
>> situations (like configure or make ext2fs etc) I fund that while running
>> some heavy *multi-threaded stress tests* on V3, a significant
>> performance degradation happens due to lock contention bottleneck caused by
>> the eager VFS memcpy hot-path. (memcpy inside journal_dirty_block which
>> is called 1000s of time a second really becomes a performance problem.)
>>
>>  I have been working on a much cleaner approach that safely defers the
>> block copying to the quiescent transaction stop state. It completely
>> eliminates the VFS lock contention and brings the journaled performance
>> back to vanilla ext2fs levels even with many threads competing at
>> writing/reading/renaming in the same place.
>>
>> I am going to test this new architecture thoroughly over the next few
>> days and will send it as V4 once I am certain it is rock solid.
>>
>> Thanks!
>>
>>
>> On Mon, Mar 9, 2026 at 12:15 PM Milos Nikic <[email protected]>
>> wrote:
>>
>>>
>>>    -
>>>
>>>    Hello Samuel and the Hurd team,
>>>
>>>    I am sending over v3 of the journaling patch. I know v2 is still
>>>    pending review, but while testing and profiling based on previous 
>>> feedback,
>>>    I realized the standard mapping wasn't scaling well for metadata-heavy
>>>    workloads. I wanted to send this updated architecture your way to save 
>>> you
>>>    from spending time reviewing the obsolete v2 code.
>>>
>>>    This version keeps the core JBD2 logic from v2 but introduces
>>>    several structural optimizations, bug fixes, and code cleanups:
>>>        - Robin Hood Hash Map: Replaced ihash with a custom map for
>>>    significantly tighter cache locality and faster lookups.
>>>        - O(1) Slab Allocator: Added a pre-allocated pool to make
>>>    transaction buffers zero-allocation in the hot path.
>>>        -  Unified Buffer Tracking: Eliminated the dual linked-list/map
>>>    structure in favor of just the map, fixing a synchronization bug from v2
>>>    and simplifying the code.
>>>    -
>>>
>>>        - Few other small bug fixes
>>>        - Refactored Dirty Block Hooks: Moved the journal_dirty_block
>>>    calls from inode.c directly into the ext2fs.h low-level block computation
>>>    functions (record_global_poke, sync_global_ptr, record_indir_poke, and
>>>    alloc_sync). This feels like a more natural fit and makes it much easier 
>>> to
>>>    ensure we aren't missing any call sites.
>>>
>>>    Performance Benchmarks:
>>>    I ran repeated tests on my machine to measure the overhead,
>>>    comparing this v3 journal implementation against Vanilla Hurd.
>>>    make ext2fs (CPU/Data bound - 5 runs):
>>>        Vanilla Hurd Average: ~2m 40.6s
>>>        Journal v3 Average: ~2m 41.3s
>>>        Result: Statistical tie. Journal overhead is practically zero.
>>>
>>>    make clean && ../configure (Metadata bound - 5 runs):
>>>        Vanilla Hurd Average: ~3.90s (with latency spikes up to 4.29s)
>>>        Journal v3 Average: ~3.72s (rock-solid consistency, never
>>>    breaking 3.9s)
>>>        Result: *Journaled ext2 is actually faster and more predictable
>>>    here *due to the WAL absorbing random I/O.
>>>
>>>    Crash Consistency Proof:
>>>    Beyond performance, I wanted to demonstrate the actual crash
>>>    recovery in action.
>>>        Boot Hurd, log in, create a directory (/home/loshmi/test-dir3).
>>>        Wait for the 5-second kjournald commit tick.
>>>        Hard crash the machine (kill -9 the QEMU process on the host).
>>>
>>>    Inspecting from the Linux host before recovery shows the inode is
>>>    completely busted (as expected):
>>>    > sudo debugfs -R "stat /home/loshmi/test-dir3" /dev/nbd0
>>>    debugfs 1.47.3 (8-Jul-2025)
>>>    Inode: 373911   Type: bad type    Mode:  0000   Flags: 0x0
>>>    Generation: 0    Version: 0x00000000
>>>    User:     0   Group:     0   Size: 0
>>>    File ACL: 0 Translator: 0
>>>    Links: 0   Blockcount: 0
>>>    Fragment:  Address: 0    Number: 0    Size: 0
>>>    ctime: 0x00000000 -- Wed Dec 31 16:00:00 1969
>>>    atime: 0x00000000 -- Wed Dec 31 16:00:00 1969
>>>    mtime: 0x00000000 -- Wed Dec 31 16:00:00 1969
>>>    BLOCKS:
>>>
>>>    Note: On *Vanilla Hurd, running fsck here would permanently lose the
>>>    directory or potentially cause further damage depending on luck.*
>>>
>>>    Triggering the journal replay:
>>>    sudo e2fsck -fy /dev/nbd0
>>>
>>>    Inspecting immediately  after recovery:
>>>    > sudo debugfs -R "stat /home/loshmi/test-dir3" /dev/nbd0
>>>    debugfs 1.47.3 (8-Jul-2025)
>>>    Inode: 373911   Type: directory    Mode:  0775   Flags: 0x0
>>>    Generation: 1773077012    Version: 0x00000000
>>>    User:  1001   Group:  1001   Size: 4096
>>>    File ACL: 0 Translator: 0
>>>    Links: 2   Blockcount: 8
>>>    Fragment:  Address: 0    Number: 0    Size: 0
>>>    ctime: 0x69af0213 -- Mon Mar  9 10:23:31 2026
>>>    atime: 0x69af0213 -- Mon Mar  9 10:23:31 2026
>>>    mtime: 0x69af0213 -- Mon Mar  9 10:23:31 2026
>>>    BLOCKS:
>>>    (0):1507738
>>>    TOTAL: 1
>>>
>>>    The* journal successfully reconstructed the directory*, and logdump
>>>    confirms the transactions were consumed perfectly.
>>>    -
>>>
>>>    I have run similar hard-crash tests for rename, chmod, and chown etc
>>>    with the same successful recovery results.
>>>
>>>    I've attached the v3 diff. Let me know what you think of the new
>>>    hash map and slab allocator approach!
>>>
>>>    Best,
>>>    Milos
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Fri, Mar 6, 2026 at 10:06 PM Milos Nikic <[email protected]>
>>> wrote:
>>>
>>>> And here is the last one...
>>>>
>>>> I hacked up an improvement for journal_dirty_block to try and see if i
>>>> could speed it up a bit.
>>>> 1) Used specialized robin hood based hash table for speed (no
>>>> tombstones etc) (I took it from one of my personal projects....just
>>>> specialized it here a bit)
>>>> 2) used a small slab allocator to avoid malloc-ing in the hot path
>>>> 3) liberally sprinkled  __rdtsc() to get a sense of cycle time inside
>>>> journal_dirty_block
>>>>
>>>> Got to say, just this simple local change managed to shave off 3-5% of
>>>> slowness.
>>>>
>>>> So my test is:
>>>> - Boot Hurd
>>>> - Inside Hurd go to the Hurd build directory
>>>> - run:
>>>> $ make clean && ../configure
>>>> $ time make ext2fs
>>>>
>>>> I do it multiple times for 3 different versions of ext2 libraries
>>>> 1) Vanilla Hurd (No Journal): ~avg, 151 seconds
>>>>
>>>> 2) Enhanced JBD2 (Slab + Custom Hash): ~159 seconds (5% slower!)
>>>>
>>>> 3) Baseline JBD2 (malloc + libihash what was sent in V2): ~168 seconds
>>>>
>>>> Of course there is a lot of variability, and my laptop is not a perfect
>>>> environment for these kinds of benchmarks, but this is what i have.
>>>>
>>>> My printouts on the screen show this:
>>>> ext2fs: part:5:device:wd0: warning: === JBD2 STATS ===
>>>> ext2fs: part:5:device:wd0: warning: Total Dirty Calls:      339105
>>>> ext2fs: part:5:device:wd0: warning: Total Function:         217101909
>>>> cycles
>>>> ext2fs: part:5:device:wd0: warning: Total Lock Wait:        16741691
>>>> cycles
>>>> ext2fs: part:5:device:wd0: warning: Total Alloc:            673363
>>>> cycles
>>>> ext2fs: part:5:device:wd0: warning: Total Memcpy:           137938008
>>>> cycles
>>>> ext2fs: part:5:device:wd0: warning: Total Hash Add:         258533
>>>> cycles
>>>> ext2fs: part:5:device:wd0: warning: Total Hash Find:        29501960
>>>> cycles
>>>> ext2fs: part:5:device:wd0: warning: --- AVERAGES (Amortized per call)
>>>> ---
>>>> ext2fs: part:5:device:wd0: warning: Avg Function Time: 640 cycles
>>>> ext2fs: part:5:device:wd0: warning: Avg Lock Wait:     49 cycles
>>>> ext2fs: part:5:device:wd0: warning: Avg Memcpy:        406 cycles
>>>> ext2fs: part:5:device:wd0: warning: Avg Malloc 1:      1 cycles
>>>> ext2fs: part:5:device:wd0: warning: Avg Hash Add:      0 cycles
>>>> ext2fs: part:5:device:wd0: warning: Avg Hash Find:      86 cycles
>>>> ext2fs: part:5:device:wd0: warning: ==================
>>>>
>>>> Averages here say a lot...with these improvements we are now down to
>>>> basically Memcpy time...and for copying 4096 bytes  of ram Im not sure we
>>>> can make it take less than 400 cycles...so we are hitting hardware
>>>> limitations.
>>>> It would be great if we could avoid memcpy here altogether or delay it
>>>> until commit or similar, and i have some ideas, but they all require
>>>> drastic changes across libdiskfs and ext2fs, not sure if a few remaining
>>>> percentage points of slowdown warrant that.
>>>>
>>>> Also, wow during ext2 compilation...this function (journal_dirty_block)
>>>> is being called a bit more than 1000 times per second (for each and every
>>>> block that is ever being touched by the compiler)
>>>>
>>>> I am attaching here the altered journal.c with these changes if one is
>>>> interested in seeing the localized changes.
>>>>
>>>> Regards,
>>>> Milos
>>>>
>>>> On Fri, Mar 6, 2026 at 11:09 AM Milos Nikic <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Samuel,
>>>>>
>>>>> One quick detail I forgot to mention regarding the performance
>>>>> analysis:
>>>>>
>>>>> The entire ~0.4s performance impact I measured is isolated exclusively
>>>>> to journal_dirty_block.
>>>>>
>>>>> To verify this, I ran an experiment where I stubbed out
>>>>> journal_dirty_block so it just returned immediately (which obviously
>>>>> makes for a very fast, but not very useful, journal!). With that single
>>>>> function bypassed, the filesystem performs identically to vanilla Hurd.
>>>>>
>>>>> This confirms that the background kjournald flusher, the transaction
>>>>> reference counting, and the checkpointing logic add absolutely no
>>>>> noticeable latency to the VFS. The overhead is strictly tied to the 
>>>>> physics
>>>>> of the memory copying and hashmap lookups in that one block which we can
>>>>> improve in subsequent patches.
>>>>>
>>>>> Thanks, Milos
>>>>>
>>>>> On Fri, Mar 6, 2026 at 10:55 AM Milos Nikic <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi Samuel,
>>>>>>
>>>>>> Thanks for reviewing my mental model on V1; I appreciate the detailed
>>>>>> feedback.
>>>>>>
>>>>>> Attached is the v2 patch. Here is a breakdown of the architectural
>>>>>> changes and refactors based on your review:
>>>>>>
>>>>>> 1. diskfs_node_update and the Pager
>>>>>> Regarding the question, "Do we really want to update the node?": Yes,
>>>>>> we must update it with every change. JBD2 works strictly at the physical
>>>>>> block level, not the abstract node cache level. To capture a node change 
>>>>>> in
>>>>>> the journal, the block content must be physically serialized to the
>>>>>> transaction buffer. Currently, this path is diskfs_node_update ->
>>>>>> diskfs_write_disknode -> journal_dirty_block.
>>>>>> When wait is 0, this just copies the node details from the node-cache
>>>>>> to the pager. It is strictly an in-memory serialization and is extremely
>>>>>> fast. I have updated the documentation for diskfs_node_update to 
>>>>>> explicitly
>>>>>> describe this behavior so future maintainers understand it isn't 
>>>>>> triggering
>>>>>> synchronous disk I/O and doesn't measurably increase the latency of the
>>>>>> file system.
>>>>>> journal_dirty_block  is not one of the most hammered functions in
>>>>>> libdiskfs/ext2 and more on that below.
>>>>>>
>>>>>> 2. Synchronous Wait & Factorization
>>>>>> I completely agree with your factorization advice:
>>>>>> write_disknode_journaled has been folded directly into
>>>>>> diskfs_write_disknode, making it much cleaner.
>>>>>> Regarding the wait flag: we are no longer ignoring it! Instead of
>>>>>> blocking the VFS deeply in the stack, we now set an "IOU" flag on the
>>>>>> transaction. This bubbles the sync requirement up to the outer RPC layer,
>>>>>> which is the only place safe enough to actually sleep on the commit and
>>>>>> thus maintain the POSIX sync requirement without deadlocking etc.
>>>>>>
>>>>>> 3. Multiple Writes to the Same Metadata Block
>>>>>> "Can it happen that we write several times to the same metadata
>>>>>> block?" Yes, multiple nodes can live in the same block. However, because
>>>>>> the Mach pager always flushes the "latest snapshot" of the block, we 
>>>>>> don't
>>>>>> have an issue with mixed or stale data hitting the disk.
>>>>>> If RPCs hit while pager is actively writing that is all captured in
>>>>>> the "RUNNING TRANSACTION". If it happens that that RUNNING TRANSACTION 
>>>>>> has
>>>>>> the same blocks pager is committing RUNNING TRANSACTION will be forcebliy
>>>>>> committed.
>>>>>>
>>>>>> 4. The New libdiskfs API
>>>>>> I added two new opaque accessors to diskfs.h:
>>>>>>
>>>>>>     diskfs_journal_set_sync
>>>>>>     diskfs_journal_needs_sync
>>>>>>
>>>>>>     This allows inner nested functions to declare a strict need for a
>>>>>> POSIX sync without causing lock inversions. We only commit at the top RPC
>>>>>> layer once the operation is fully complete and locks are dropped.
>>>>>>
>>>>>> 5. Cleanups & Ordering
>>>>>>     Removed the redundant record_global_poke calls.
>>>>>>     Reordered the pager write notification in journal.c to sit after
>>>>>> the committing function, as the pager write happens after the journal
>>>>>> commit.
>>>>>>     Merged the ext2_journal checks inside
>>>>>> diskfs_journal_start_transaction to return early.
>>>>>>     Reverted the bold unlock moves.
>>>>>>     Fixed the information leaks.
>>>>>>     Elevated the deadlock/WAL bypass logs to ext2_warning.
>>>>>>
>>>>>> Performance:
>>>>>> I investigated the ~0.4s (increase from 4.9s to 5.3s) regression on
>>>>>> my SSD during a heavy Hurd ../configure test. By stubbing out
>>>>>> journal_dirty_block, performance returned to vanilla Hurd speeds, 
>>>>>> isolating
>>>>>> the overhead to that specific function.
>>>>>>
>>>>>> A nanosecond profile reveals the cost is evenly split across the
>>>>>> mandatory physics of a block journal:
>>>>>>
>>>>>>     25%: Lock Contention (Global transaction serialization)
>>>>>>
>>>>>>     22%: Memcpy (Shadowing the 4KB blocks)
>>>>>>
>>>>>>     21%: Hash Find (hurd_ihash lookups for block deduplication)
>>>>>>
>>>>>> I was surprised to see hurd_ihash taking up nearly a quarter of the
>>>>>> overhead. I added some collision mitigation, but left a further
>>>>>> improvements of this patch to keep the scope tight. In the future, we 
>>>>>> could
>>>>>> drop the malloc entirely using a slab allocator and optimize the hashmap 
>>>>>> to
>>>>>> get this overhead closer to zero (along with introducing a "frozen data"
>>>>>> concept like Linux does but that would be a bigger non localized change).
>>>>>>
>>>>>> Final Note on Lock Hierarchy
>>>>>> The intended, deadlock-free use of the journal in libdiskfs is best
>>>>>> illustrated by the CHANGE_NODE_FIELD macro in libdiskfs/priv.h
>>>>>>   txn = diskfs_journal_start_transaction ();
>>>>>>   pthread_mutex_lock (&np->lock);
>>>>>>   (OPERATION);
>>>>>>   diskfs_node_update (np, diskfs_synchronous);
>>>>>>   pthread_mutex_unlock (&np->lock);
>>>>>>   if (diskfs_synchronous || diskfs_journal_needs_sync (txn))
>>>>>>
>>>>>>     diskfs_journal_commit_transaction (txn);
>>>>>>   else
>>>>>>     diskfs_journal_stop_transaction (txn);
>>>>>>
>>>>>> By keeping journal operations strictly outside of the node
>>>>>> locking/unlocking phases, we treat it as the outermost "lock" on the file
>>>>>> system, mathematically preventing deadlocks.
>>>>>>
>>>>>> Kind regards,
>>>>>> Milos
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Mar 5, 2026 at 12:41 PM Samuel Thibault <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> Milos Nikic, le jeu. 05 mars 2026 09:31:26 -0800, a ecrit:
>>>>>>> > Hurd VFS works in 3 layers:
>>>>>>> >
>>>>>>> >  1. Node cache layer: The abstract node lives here and it is the
>>>>>>> ground truth
>>>>>>> >     of a running file system. When one does a stat myfile.txt, we
>>>>>>> get the
>>>>>>> >     information straight from the cache. When we create a new
>>>>>>> file, it gets
>>>>>>> >     placed in the cache, etc.
>>>>>>> >
>>>>>>> >  2. Pager layer: This is where nodes are serialized into the
>>>>>>> actual physical
>>>>>>> >     representation (4KB blocks) that will later be written to disk.
>>>>>>> >
>>>>>>> >  3. Hard drive: The physical storage that receives the bytes from
>>>>>>> the pager.
>>>>>>> >
>>>>>>> > During normal operations (not a sync mount, fsync, etc.), the VFS
>>>>>>> operates
>>>>>>> > almost entirely on Layer 1: The Node cache layer. This is why it's
>>>>>>> super fast.
>>>>>>> > User changed atime? No problem. It just fetches a node from the
>>>>>>> node cache
>>>>>>> > (hash table lookup, amortized to O(1)) and updates the struct in
>>>>>>> memory. And
>>>>>>> > that is it.
>>>>>>>
>>>>>>> Yes, so that we get as efficient as possible.
>>>>>>>
>>>>>>> > Only when the sync interval hits (every 30 seconds by default)
>>>>>>> does the Node
>>>>>>> > cache get iterated and serialized to the pager layer
>>>>>>> (diskfs_sync_everything ->
>>>>>>> >  write_all_disknodes -> write_node -> pager_sync). So basically,
>>>>>>> at that
>>>>>>> > moment, we create a snapshot of the state of the node cache and
>>>>>>> place it onto
>>>>>>> > the pager(s).
>>>>>>>
>>>>>>> It's not exactly a snapshot because the coherency between inodes and
>>>>>>> data is not completely enforced (we write all disknodes before asking
>>>>>>> the kernel to write back dirty pages, and then poke the writes).
>>>>>>>
>>>>>>> > Even then, pager_sync is called with wait = 0. It is handed to the
>>>>>>> pager, which
>>>>>>> > sends it to Mach. At some later time (seconds or so later), Mach
>>>>>>> sends it back
>>>>>>> > to the ext2 pager, which finally issues store_write to write it to
>>>>>>> Layer 3 (The
>>>>>>> > Hard drive). And even that depends on how the driver reorders or
>>>>>>> delays it.
>>>>>>> >
>>>>>>> > The effect of this architecture is that when store_write is
>>>>>>> finally called, the
>>>>>>> > absolute latest version of the node cache snapshot is what gets
>>>>>>> written to the
>>>>>>> > storage. Is this basically correct?
>>>>>>>
>>>>>>> It seems to be so indeed.
>>>>>>>
>>>>>>> > Are there any edge cases or mechanics that are wrong in this model
>>>>>>> > that would make us receive a "stale" node cache snapshot?
>>>>>>>
>>>>>>> Well, it can be "stale" if another RPC hasn't called
>>>>>>> diskfs_node_update() yet, but that's what "safe" FS are all about:
>>>>>>> not
>>>>>>> actually provide more than coherency of the content on the disk so
>>>>>>> fsck
>>>>>>> is not suppposed to be needed. Then, if a program really wants
>>>>>>> coherency
>>>>>>> between some files etc. it has to issue sync calls, dpkg does it for
>>>>>>> instance.
>>>>>>>
>>>>>>> Samuel
>>>>>>>
>>>>>>

Re: [PATCH 1/3] ext2fs: Add JBD2 on-disk layout headers

Reply via email to