On Thu, Jul 12, 2018 at 01:36:43AM +0100, [email protected] wrote: > From: Filipe Manana <[email protected]> > > When we clone a range into a file we can end up dropping existing > extent maps (or trimming them) and replacing them with new ones if the > range to be cloned overlaps with a range in the destination inode. > When that happens we add the new extent maps to the list of modified > extents in the inode's extent map tree, so that a "fast" fsync (the flag > BTRFS_INODE_NEEDS_FULL_SYNC not set in the inode) will see the extent maps > and log corresponding extent items. However, at the end of range cloning > operation we do truncate all the pages in the affected range (in order to > ensure future reads will not get stale data). Sometimes this truncation > will release the corresponding extent maps besides the pages from the page > cache. If this happens, then a "fast" fsync operation will miss logging > some extent items, because it relies exclusively on the extent maps being > present in the inode's extent tree, leading to data loss/corruption if > the fsync ends up using the same transaction used by the clone operation > (that transaction was not committed in the meanwhile). An extent map is > released through the callback btrfs_invalidatepage(), which gets called by > truncate_inode_pages_range(), and it calls __btrfs_releasepage(). The > later ends up calling try_release_extent_mapping() which will release the > extent map if some conditions are met, like the file size being greater > than 16Mb, gfp flags allow blocking and the range not being locked (which > is the case during the clone operation) nor being the extent map flagged > as pinned (also the case for cloning). > > The following example, turned into a test for fstests, reproduces the > issue: > > $ mkfs.btrfs -f /dev/sdb > $ mount /dev/sdb /mnt > > $ xfs_io -f -c "pwrite -S 0x18 9000K 6908K" /mnt/foo > $ xfs_io -f -c "pwrite -S 0x20 2572K 156K" /mnt/bar > > $ xfs_io -c "fsync" /mnt/bar > # reflink destination offset corresponds to the size of file bar, > # 2728Kb minus 4Kb. > $ xfs_io -c ""reflink ${SCRATCH_MNT}/foo 0 2724K 15908K" /mnt/bar > $ xfs_io -c "fsync" /mnt/bar > > $ md5sum /mnt/bar > 95a95813a8c2abc9aa75a6c2914a077e /mnt/bar > > <power fail> > > $ mount /dev/sdb /mnt > $ md5sum /mnt/bar > 207fd8d0b161be8a84b945f0df8d5f8d /mnt/bar > # digest should be 95a95813a8c2abc9aa75a6c2914a077e like before the > # power failure > > In the above example, the destination offset of the clone operation > corresponds to the size of the "bar" file minus 4Kb. So during the clone > operation, the extent map covering the range from 2572Kb to 2728Kb gets > trimmed so that it ends at offset 2724Kb, and a new extent map covering > the range from 2724Kb to 11724Kb is created. So at the end of the clone > operation when we ask to truncate the pages in the range from 2724Kb to > 2724Kb + 15908Kb, the page invalidation callback ends up removing the new > extent map (through try_release_extent_mapping()) when the page at offset > 2724Kb is passed to that callback. > > Fix this by setting the bit BTRFS_INODE_NEEDS_FULL_SYNC whenever an extent > map is removed at try_release_extent_mapping(), forcing the next fsync to > search for modified extents in the fs/subvolume tree instead of relying on > the presence of extent maps in memory. This way we can continue doing a > "fast" fsync if the destination range of a clone operation does not > overlap with an existing range or if any of the criteria necessary to > remove an extent map at try_release_extent_mapping() is not met (file > size not bigger then 16Mb or gfp flags do not allow blocking). > > CC: [email protected] # 3.16+ > Signed-off-by: Filipe Manana <[email protected]>
I've added this to misc-next and will forward it to 4.18 as the type of fix qualifies for a late rc state, thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html
