Re: [PATCH v2] Btrfs: fix file data corruption after cloning a range and fsync

David Sterba Thu, 19 Jul 2018 06:36:03 -0700

On Thu, Jul 12, 2018 at 01:36:43AM +0100, [email protected] wrote:
> From: Filipe Manana <[email protected]>
> 
> When we clone a range into a file we can end up dropping existing
> extent maps (or trimming them) and replacing them with new ones if the
> range to be cloned overlaps with a range in the destination inode.
> When that happens we add the new extent maps to the  list of modified
> extents in the inode's extent map tree, so that a "fast" fsync (the flag
> BTRFS_INODE_NEEDS_FULL_SYNC not set in the inode) will see the extent maps
> and log corresponding extent items. However, at the end of range cloning
> operation we do truncate all the pages in the affected range (in order to
> ensure future reads will not get stale data). Sometimes this truncation
> will release the corresponding extent maps besides the pages from the page
> cache. If this happens, then a "fast" fsync operation will miss logging
> some extent items, because it relies exclusively on the extent maps being
> present in the inode's extent tree, leading to data loss/corruption if
> the fsync ends up using the same transaction used by the clone operation
> (that transaction was not committed in the meanwhile). An extent map is
> released through the callback btrfs_invalidatepage(), which gets called by
> truncate_inode_pages_range(), and it calls __btrfs_releasepage(). The
> later ends up calling try_release_extent_mapping() which will release the
> extent map if some conditions are met, like the file size being greater
> than 16Mb, gfp flags allow blocking and the range not being locked (which
> is the case during the clone operation) nor being the extent map flagged
> as pinned (also the case for cloning).
> 
> The following example, turned into a test for fstests, reproduces the
> issue:
> 
>   $ mkfs.btrfs -f /dev/sdb
>   $ mount /dev/sdb /mnt
> 
>   $ xfs_io -f -c "pwrite -S 0x18 9000K 6908K" /mnt/foo
>   $ xfs_io -f -c "pwrite -S 0x20 2572K 156K" /mnt/bar
> 
>   $ xfs_io -c "fsync" /mnt/bar
>   # reflink destination offset corresponds to the size of file bar,
>   # 2728Kb minus 4Kb.
>   $ xfs_io -c ""reflink ${SCRATCH_MNT}/foo 0 2724K 15908K" /mnt/bar
>   $ xfs_io -c "fsync" /mnt/bar
> 
>   $ md5sum /mnt/bar
>   95a95813a8c2abc9aa75a6c2914a077e  /mnt/bar
> 
>   <power fail>
> 
>   $ mount /dev/sdb /mnt
>   $ md5sum /mnt/bar
>   207fd8d0b161be8a84b945f0df8d5f8d  /mnt/bar
>   # digest should be 95a95813a8c2abc9aa75a6c2914a077e like before the
>   # power failure
> 
> In the above example, the destination offset of the clone operation
> corresponds to the size of the "bar" file minus 4Kb. So during the clone
> operation, the extent map covering the range from 2572Kb to 2728Kb gets
> trimmed so that it ends at offset 2724Kb, and a new extent map covering
> the range from 2724Kb to 11724Kb is created. So at the end of the clone
> operation when we ask to truncate the pages in the range from 2724Kb to
> 2724Kb + 15908Kb, the page invalidation callback ends up removing the new
> extent map (through try_release_extent_mapping()) when the page at offset
> 2724Kb is passed to that callback.
> 
> Fix this by setting the bit BTRFS_INODE_NEEDS_FULL_SYNC whenever an extent
> map is removed at try_release_extent_mapping(), forcing the next fsync to
> search for modified extents in the fs/subvolume tree instead of relying on
> the presence of extent maps in memory. This way we can continue doing a
> "fast" fsync if the destination range of a clone operation does not
> overlap with an existing range or if any of the criteria necessary to
> remove an extent map at try_release_extent_mapping() is not met (file
> size not bigger then 16Mb or gfp flags do not allow blocking).
> 
> CC: [email protected] # 3.16+
> Signed-off-by: Filipe Manana <[email protected]>


I've added this to misc-next and will forward it to 4.18 as the type of
fix qualifies for a late rc state, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2] Btrfs: fix file data corruption after cloning a range and fsync

Reply via email to