Hugo Mills wrote on 2015/11/24 22:33 +0000:
On Tue, Nov 24, 2015 at 04:26:47PM -0600, Eric Sandeen wrote:
On 11/24/15 2:38 PM, Austin S Hemmelgarn wrote:

if the system was
shut down cleanly, you're fine barring software bugs, but if it
crashed, you should be running a check on the FS.

Um, no...

The *entire point* of having a journaling filesystem is that after a
crash or power loss, a journal replay on next mount will bring the
metadata into a consistent state.

    Not an actual argument within the discussion, but an interesting
observation on a fine distinction:

    It's interesting to note that there's a difference here between
journalling and CoW filesystems. A journalling FS needs a journal
replay to become consistent. A CoW FS is _always_ consistent, by
design. Now, btrfs has a log that should be replayed after an unclean
shutdown, but that's all about the data that got written within the
current transaction that wasn't committed,

In fact, log tree of btrfs is only used to speedup fsync. And there is a "notreelog" mount option to disable such log tree, if one uses it, fsync performance will just drop to the level of sync.

So it's just an optimization, although it's already quite away from the original topic, I think the best method for btrfs to improve fsync performance is to introduce something like ext*:

Per-file extent map tree.


The reason btrfs is slow on fsync is, file extent and inode info are all stored in the same tree(fs tree or subvolume tree).

To only fsync a inode, it's impossible only fsync all its file extents, but to sync the whole tree, which may just as slow as a full sync.

That's why log tree is introduced, only writeback file extents of an inode and record its metadata changes into the log tree.
And performance test result also supports this.


But other filesystem, at least ext* uses a better solution, each inode (no matter regular file or dir) has its own tree to record its file extents or dir entries.
Making fsync quite easy and straightforward.

If btrfs follows the same design, at least the random RW performance may have a boost and simplify the fsync codes.

Thanks,
Qu


rather than about FS
metadata consistency. This means that a read-only mount of btrfs can
_actually_ be read-only, not modifying any of the data on the disk,
whereas a read-only mount of a journalling FS _must_ modify the disk
data after an unclean shitdown, in order to be useful (because the FS
isn't consistent without the journal replay).

    Hugo.



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to