Re: [PATCH] Btrfs: fix missing data checksums after a ranged fsync (msync)

2018-11-01 Thread David Sterba
On Mon, Oct 29, 2018 at 09:42:06AM +, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> Recently we got a massive simplification for fsync, where for the fast
> path we no longer log new extents while their respective ordered extents
> are still running.
> 
> However that simplification introduced a subtle regression for the case
> where we use a ranged fsync (msync). Consider the following example:
> 
>CPU 0CPU 1
> 
> mmap write to range [2Mb, 4Mb[
>   mmap write to range [512Kb, 1Mb[
>   msync range [512K, 1Mb[
> --> triggers fast fsync
> (BTRFS_INODE_NEEDS_FULL_SYNC
>  not set)
> --> creates extent map A for this
> range and adds it to list of
> modified extents
> --> starts ordered extent A for
> this range
> --> waits for it to complete
> 
> writeback triggered for range
> [2Mb, 4Mb[
>   --> create extent map B and
>   adds it to the list of
>   modified extents
>   --> creates ordered extent B
> 
> --> start looking for and logging
> modified extents
> --> logs extent maps A and B
> --> finds checksums for extent A
> in the csum tree, but not for
> extent B
>   fsync (msync) finishes
> 
>   --> ordered extent B
>   finishes and its
>   checksums are added
>   to the csum tree
> 
> 
> 
> After replaying the log, we have the extent covering the range [2Mb, 4Mb[
> but do not have the data checksum items covering that file range.
> 
> This happens because at the very beginning of an fsync (btrfs_sync_file())
> we start and wait for IO in the given range [512Kb, 1Mb[ and therefore
> wait for any ordered extents in that range to complete before we start
> logging the extents. However if right before we start logging the extent
> in our range [512Kb, 1Mb[, writeback is started for any other dirty range,
> such as the range [2Mb, 4Mb[ due to memory pressure or a concurrent fsync
> or msync (btrfs_sync_file() starts writeback before acquiring the inode's
> lock), an ordered extent is created for that other range and a new extent
> map is created to represent that range and added to the inode's list of
> modified extents.
> 
> That means that we will see that other extent in that list when collecting
> extents for logging (done at btrfs_log_changed_extents()) and log the
> extent before the respective ordered extent finishes - namely before the
> checksum items are added to the checksums tree, which is where
> log_extent_csums() looks for the checksums, therefore making us log an
> extent without logging its checksums. Before that massive simplification
> of fsync, this wasn't a problem because besides looking for checkums in
> the checksums tree, we also looked for them in any ordered extent still
> running.
> 
> The consequence of data checksums missing for a file range is that users
> attempting to read the affected file range will get -EIO errors and dmesg
> reports the following:
> 
>  [10188.358136] BTRFS info (device sdc): no csum found for inode 297 start 
> 57344
>  [10188.359278] BTRFS warning (device sdc): csum failed root 5 ino 297 off 
> 57344 csum 0x98f94189 expected csum 0x mirror 1
> 
> So fix this by skipping an extents outside of our logging range at
> btrfs_log_changed_extents() and leaving them on the list of modified
> extents so that any subsequent ranged fsync may collect them if needed.
> Also, if we find a hole extent outside of the range still log it, just
> to prevent having gaps between extent items after replaying the log,
> otherwise fsck will complain when we are not using the NO_HOLES feature
> (fstest btrfs/056 triggers such case).
> 
> Fixes: e7175a692765 ("btrfs: remove the wait ordered logic in the 
> log_one_extent path")
> CC: sta...@vger.kernel.org # 4.19+
> Signed-off-by: Filipe Manana 

Thanks, added to misc-next and on the way to 4.20.


Re: [PATCH] Btrfs: fix missing data checksums after a ranged fsync (msync)

2018-10-31 Thread Josef Bacik
On Mon, Oct 29, 2018 at 09:42:06AM +, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> Recently we got a massive simplification for fsync, where for the fast
> path we no longer log new extents while their respective ordered extents
> are still running.
> 
> However that simplification introduced a subtle regression for the case
> where we use a ranged fsync (msync). Consider the following example:
> 
>CPU 0CPU 1
> 
> mmap write to range [2Mb, 4Mb[
>   mmap write to range [512Kb, 1Mb[
>   msync range [512K, 1Mb[
> --> triggers fast fsync
> (BTRFS_INODE_NEEDS_FULL_SYNC
>  not set)
> --> creates extent map A for this
> range and adds it to list of
> modified extents
> --> starts ordered extent A for
> this range
> --> waits for it to complete
> 
> writeback triggered for range
> [2Mb, 4Mb[
>   --> create extent map B and
>   adds it to the list of
>   modified extents
>   --> creates ordered extent B
> 
> --> start looking for and logging
> modified extents
> --> logs extent maps A and B
> --> finds checksums for extent A
> in the csum tree, but not for
> extent B
>   fsync (msync) finishes
> 
>   --> ordered extent B
>   finishes and its
>   checksums are added
>   to the csum tree
> 
> 
> 
> After replaying the log, we have the extent covering the range [2Mb, 4Mb[
> but do not have the data checksum items covering that file range.
> 
> This happens because at the very beginning of an fsync (btrfs_sync_file())
> we start and wait for IO in the given range [512Kb, 1Mb[ and therefore
> wait for any ordered extents in that range to complete before we start
> logging the extents. However if right before we start logging the extent
> in our range [512Kb, 1Mb[, writeback is started for any other dirty range,
> such as the range [2Mb, 4Mb[ due to memory pressure or a concurrent fsync
> or msync (btrfs_sync_file() starts writeback before acquiring the inode's
> lock), an ordered extent is created for that other range and a new extent
> map is created to represent that range and added to the inode's list of
> modified extents.
> 
> That means that we will see that other extent in that list when collecting
> extents for logging (done at btrfs_log_changed_extents()) and log the
> extent before the respective ordered extent finishes - namely before the
> checksum items are added to the checksums tree, which is where
> log_extent_csums() looks for the checksums, therefore making us log an
> extent without logging its checksums. Before that massive simplification
> of fsync, this wasn't a problem because besides looking for checkums in
> the checksums tree, we also looked for them in any ordered extent still
> running.
> 
> The consequence of data checksums missing for a file range is that users
> attempting to read the affected file range will get -EIO errors and dmesg
> reports the following:
> 
>  [10188.358136] BTRFS info (device sdc): no csum found for inode 297 start 
> 57344
>  [10188.359278] BTRFS warning (device sdc): csum failed root 5 ino 297 off 
> 57344 csum 0x98f94189 expected csum 0x mirror 1
> 
> So fix this by skipping an extents outside of our logging range at
> btrfs_log_changed_extents() and leaving them on the list of modified
> extents so that any subsequent ranged fsync may collect them if needed.
> Also, if we find a hole extent outside of the range still log it, just
> to prevent having gaps between extent items after replaying the log,
> otherwise fsck will complain when we are not using the NO_HOLES feature
> (fstest btrfs/056 triggers such case).
> 
> Fixes: e7175a692765 ("btrfs: remove the wait ordered logic in the 
> log_one_extent path")
> CC: sta...@vger.kernel.org # 4.19+
> Signed-off-by: Filipe Manana 

Nice catch,

Reviewed-by: Josef Bacik 

Josef


[PATCH] Btrfs: fix missing data checksums after a ranged fsync (msync)

2018-10-29 Thread fdmanana
From: Filipe Manana 

Recently we got a massive simplification for fsync, where for the fast
path we no longer log new extents while their respective ordered extents
are still running.

However that simplification introduced a subtle regression for the case
where we use a ranged fsync (msync). Consider the following example:

   CPU 0CPU 1

mmap write to range [2Mb, 4Mb[
  mmap write to range [512Kb, 1Mb[
  msync range [512K, 1Mb[
--> triggers fast fsync
(BTRFS_INODE_NEEDS_FULL_SYNC
 not set)
--> creates extent map A for this
range and adds it to list of
modified extents
--> starts ordered extent A for
this range
--> waits for it to complete

writeback triggered for range
[2Mb, 4Mb[
  --> create extent map B and
  adds it to the list of
  modified extents
  --> creates ordered extent B

--> start looking for and logging
modified extents
--> logs extent maps A and B
--> finds checksums for extent A
in the csum tree, but not for
extent B
  fsync (msync) finishes

  --> ordered extent B
  finishes and its
  checksums are added
  to the csum tree



After replaying the log, we have the extent covering the range [2Mb, 4Mb[
but do not have the data checksum items covering that file range.

This happens because at the very beginning of an fsync (btrfs_sync_file())
we start and wait for IO in the given range [512Kb, 1Mb[ and therefore
wait for any ordered extents in that range to complete before we start
logging the extents. However if right before we start logging the extent
in our range [512Kb, 1Mb[, writeback is started for any other dirty range,
such as the range [2Mb, 4Mb[ due to memory pressure or a concurrent fsync
or msync (btrfs_sync_file() starts writeback before acquiring the inode's
lock), an ordered extent is created for that other range and a new extent
map is created to represent that range and added to the inode's list of
modified extents.

That means that we will see that other extent in that list when collecting
extents for logging (done at btrfs_log_changed_extents()) and log the
extent before the respective ordered extent finishes - namely before the
checksum items are added to the checksums tree, which is where
log_extent_csums() looks for the checksums, therefore making us log an
extent without logging its checksums. Before that massive simplification
of fsync, this wasn't a problem because besides looking for checkums in
the checksums tree, we also looked for them in any ordered extent still
running.

The consequence of data checksums missing for a file range is that users
attempting to read the affected file range will get -EIO errors and dmesg
reports the following:

 [10188.358136] BTRFS info (device sdc): no csum found for inode 297 start 57344
 [10188.359278] BTRFS warning (device sdc): csum failed root 5 ino 297 off 
57344 csum 0x98f94189 expected csum 0x mirror 1

So fix this by skipping an extents outside of our logging range at
btrfs_log_changed_extents() and leaving them on the list of modified
extents so that any subsequent ranged fsync may collect them if needed.
Also, if we find a hole extent outside of the range still log it, just
to prevent having gaps between extent items after replaying the log,
otherwise fsck will complain when we are not using the NO_HOLES feature
(fstest btrfs/056 triggers such case).

Fixes: e7175a692765 ("btrfs: remove the wait ordered logic in the 
log_one_extent path")
CC: sta...@vger.kernel.org # 4.19+
Signed-off-by: Filipe Manana 
---
 fs/btrfs/tree-log.c | 17 +
 1 file changed, 17 insertions(+)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index c86c5dd100b2..d49edd25f2e5 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -4394,6 +4394,23 @@ static int btrfs_log_changed_extents(struct 
btrfs_trans_handle *trans,
test_gen = root->fs_info->last_trans_committed;
 
list_for_each_entry_safe(em, n, >modified_extents, list) {
+   /*
+* Skip extents outside our logging range. It's important to do
+* it for correctness because if we don't ignore them, we may
+* log them before their ordered extent completes, and therefore
+* we could log them without logging their respective checksums
+* (the checksum items are added to the csum tree at the