On Tue, Feb 16, 2021 at 11:09:25AM +0000, fdman...@kernel.org wrote: > From: Filipe Manana <fdman...@suse.com> > > When using the NO_HOLES feature, if we clone a file range that spans only > a hole into a range that is at or beyond the current i_size of the > destination file, we end up not setting the full sync runtime flag on the > inode. As a result, if we then fsync the destination file and have a power > failure, after log replay we can end up exposing stale data instead of > having a hole for that range. > > The conditions for this to happen are the following: > > 1) We have a file with a size of, for example, 1280K; > > 2) There is a written (non-prealloc) extent for the file range from 1024K > to 1280K with a length of 256K; > > 3) This particular file extent layout is durably persisted, so that the > existing superblock persisted on disk points to a subvolume root where > the file has that exact file extent layout and state; > > 4) The file is truncated to a smaller size, to an offset lower than the > start offset of its last extent, for example to 800K. The truncate sets > the full sync runtime flag on the inode; > > 6) Fsync the file to log it and clear the full sync runtime flag; > > 7) Clone a region that covers only a hole (implicit hole due to NO_HOLES) > into the file with a destination offset that starts at or beyond the > 256K file extent item we had - for example to offset 1024K; > > 8) Since the clone operation does not find extents in the source range, > we end up in the if branch at the bottom of btrfs_clone() where we > punch a hole for the file range starting at offset 1024K by calling > btrfs_replace_file_extents(). There we end up not setting the full > sync flag on the inode, because we don't know we are being called in > a clone context (and not fallocate's punch hole operation), and > neither do we create an extent map to represent a hole because the > requested range is beyond eof; > > 9) A further fsync to the file will be a fast fsync, since the clone > operation did not set the full sync flag, and therefore it relies on > modified extent maps to correctly log the file layout. But since > it does not find any extent map marking the range from 1024K (the > previous eof) to the new eof, it does not log a file extent item > for that range representing the hole; > > 10) After a power failure no hole for the range starting at 1024K is > punched and we end up exposing stale data from the old 256K extent. > > Turning this into exact steps: > > $ mkfs.btrfs -f -O no-holes /dev/sdi > $ mount /dev/sdi /mnt > > # Create our test file with 3 extents of 256K and a 256K hole at offset > # 256K. The file has a size of 1280K. > $ xfs_io -f -s \ > -c "pwrite -S 0xab -b 256K 0 256K" \ > -c "pwrite -S 0xcd -b 256K 512K 256K" \ > -c "pwrite -S 0xef -b 256K 768K 256K" \ > -c "pwrite -S 0x73 -b 256K 1024K 256K" \ > /mnt/sdi/foobar > > # Make sure it's durably persisted. We want the last committed super > # block to point to this particular file extent layout. > sync > > # Now truncate our file to a smaller size, falling within a position of > # the second extent. This sets the full sync runtime flag on the inode. > # Then fsync the file to log it and clear the full sync flag from the > # inode. The third extent is no longer part of the file and therefore > # it is not logged. > $ xfs_io -c "truncate 800K" -c "fsync" /mnt/foobar > > # Now do a clone operation that only clones the hole and sets back the > # file size to match the size it had before the truncate operation > # (1280K). > $ xfs_io \ > -c "reflink /mnt/foobar 256K 1024K 256K" \ > -c "fsync" \ > /mnt/foobar > > # File data before power failure: > $ od -A d -t x1 /mnt/foobar > 0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab > * > 0262144 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > * > 0524288 cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd > * > 0786432 ef ef ef ef ef ef ef ef ef ef ef ef ef ef ef ef > * > 0819200 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > * > 1310720 > > <power fail> > > # Mount the fs again to replay the log tree. > $ mount /dev/sdi /mnt > > # File data after power failure: > $ od -A d -t x1 /mnt/foobar > 0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab > * > 0262144 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > * > 0524288 cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd > * > 0786432 ef ef ef ef ef ef ef ef ef ef ef ef ef ef ef ef > * > 0819200 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > * > 1048576 73 73 73 73 73 73 73 73 73 73 73 73 73 73 73 73 > * > 1310720 > > The range from 1024K to 1280K should correspond to a hole but instead it > points to stale data, to the 256K extent that should not exist after the > truncate operation. > > The issue does not exists when not using NO_HOLES, because for that case > we use file extent items to represent holes, these are found and copied > during the loop that iterates over extents at btrfs_clone(), and that > causes btrfs_replace_file_extents() to be called with a non-NULL > extent_info argument and therefore set the full sync runtime flag on the > inode. > > So fix this by making the code that deals with a trailing hole during > cloning, at btrfs_clone(), to set the full sync flag on the inode, if the > range starts at or beyond the current i_size. > > A test case for fstests will follow soon. > > Signed-off-by: Filipe Manana <fdman...@suse.com>
Added to misc-next, thanks.