On Mon, Oct 29, 2018 at 09:42:06AM +0000, fdman...@kernel.org wrote: > From: Filipe Manana <fdman...@suse.com> > > Recently we got a massive simplification for fsync, where for the fast > path we no longer log new extents while their respective ordered extents > are still running. > > However that simplification introduced a subtle regression for the case > where we use a ranged fsync (msync). Consider the following example: > > CPU 0 CPU 1 > > mmap write to range [2Mb, 4Mb[ > mmap write to range [512Kb, 1Mb[ > msync range [512K, 1Mb[ > --> triggers fast fsync > (BTRFS_INODE_NEEDS_FULL_SYNC > not set) > --> creates extent map A for this > range and adds it to list of > modified extents > --> starts ordered extent A for > this range > --> waits for it to complete > > writeback triggered for range > [2Mb, 4Mb[ > --> create extent map B and > adds it to the list of > modified extents > --> creates ordered extent B > > --> start looking for and logging > modified extents > --> logs extent maps A and B > --> finds checksums for extent A > in the csum tree, but not for > extent B > fsync (msync) finishes > > --> ordered extent B > finishes and its > checksums are added > to the csum tree > > <power cut> > > After replaying the log, we have the extent covering the range [2Mb, 4Mb[ > but do not have the data checksum items covering that file range. > > This happens because at the very beginning of an fsync (btrfs_sync_file()) > we start and wait for IO in the given range [512Kb, 1Mb[ and therefore > wait for any ordered extents in that range to complete before we start > logging the extents. However if right before we start logging the extent > in our range [512Kb, 1Mb[, writeback is started for any other dirty range, > such as the range [2Mb, 4Mb[ due to memory pressure or a concurrent fsync > or msync (btrfs_sync_file() starts writeback before acquiring the inode's > lock), an ordered extent is created for that other range and a new extent > map is created to represent that range and added to the inode's list of > modified extents. > > That means that we will see that other extent in that list when collecting > extents for logging (done at btrfs_log_changed_extents()) and log the > extent before the respective ordered extent finishes - namely before the > checksum items are added to the checksums tree, which is where > log_extent_csums() looks for the checksums, therefore making us log an > extent without logging its checksums. Before that massive simplification > of fsync, this wasn't a problem because besides looking for checkums in > the checksums tree, we also looked for them in any ordered extent still > running. > > The consequence of data checksums missing for a file range is that users > attempting to read the affected file range will get -EIO errors and dmesg > reports the following: > > [10188.358136] BTRFS info (device sdc): no csum found for inode 297 start > 57344 > [10188.359278] BTRFS warning (device sdc): csum failed root 5 ino 297 off > 57344 csum 0x98f94189 expected csum 0x00000000 mirror 1 > > So fix this by skipping an extents outside of our logging range at > btrfs_log_changed_extents() and leaving them on the list of modified > extents so that any subsequent ranged fsync may collect them if needed. > Also, if we find a hole extent outside of the range still log it, just > to prevent having gaps between extent items after replaying the log, > otherwise fsck will complain when we are not using the NO_HOLES feature > (fstest btrfs/056 triggers such case). > > Fixes: e7175a692765 ("btrfs: remove the wait ordered logic in the > log_one_extent path") > CC: sta...@vger.kernel.org # 4.19+ > Signed-off-by: Filipe Manana <fdman...@suse.com>
Nice catch, Reviewed-by: Josef Bacik <jo...@toxicpanda.com> Josef