Hugo Mills wrote on 2015/11/24 22:33 +0000:
On Tue, Nov 24, 2015 at 04:26:47PM -0600, Eric Sandeen wrote:
On 11/24/15 2:38 PM, Austin S Hemmelgarn wrote:
if the system was
shut down cleanly, you're fine barring software bugs, but if it
crashed, you should be running a check on the FS.
Um, no...
The *entire point* of having a journaling filesystem is that after a
crash or power loss, a journal replay on next mount will bring the
metadata into a consistent state.
Not an actual argument within the discussion, but an interesting
observation on a fine distinction:
It's interesting to note that there's a difference here between
journalling and CoW filesystems. A journalling FS needs a journal
replay to become consistent. A CoW FS is _always_ consistent, by
design. Now, btrfs has a log that should be replayed after an unclean
shutdown, but that's all about the data that got written within the
current transaction that wasn't committed,
In fact, log tree of btrfs is only used to speedup fsync. And there is a
"notreelog" mount option to disable such log tree, if one uses it, fsync
performance will just drop to the level of sync.
So it's just an optimization, although it's already quite away from the
original topic, I think the best method for btrfs to improve fsync
performance is to introduce something like ext*:
Per-file extent map tree.
The reason btrfs is slow on fsync is, file extent and inode info are all
stored in the same tree(fs tree or subvolume tree).
To only fsync a inode, it's impossible only fsync all its file extents,
but to sync the whole tree, which may just as slow as a full sync.
That's why log tree is introduced, only writeback file extents of an
inode and record its metadata changes into the log tree.
And performance test result also supports this.
But other filesystem, at least ext* uses a better solution, each inode
(no matter regular file or dir) has its own tree to record its file
extents or dir entries.
Making fsync quite easy and straightforward.
If btrfs follows the same design, at least the random RW performance may
have a boost and simplify the fsync codes.
Thanks,
Qu
rather than about FS
metadata consistency. This means that a read-only mount of btrfs can
_actually_ be read-only, not modifying any of the data on the disk,
whereas a read-only mount of a journalling FS _must_ modify the disk
data after an unclean shitdown, in order to be useful (because the FS
isn't consistent without the journal replay).
Hugo.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html