Re: [PATCH] Btrfs: fix snapshot inconsistency after a file write followed by truncate

Chris Mason Mon, 27 Oct 2014 12:57:54 -0700

On Tue, Oct 21, 2014 at 6:12 AM, Filipe Manana <fdman...@suse.com>wrote:

If right after starting the snapshot creation ioctl we perform awrite against afile followed by a truncate, with both operations increasing thefile's size, wecan get a snapshot tree that reflects a state of the sourcesubvolume's tree wherethe file truncation happened but the write operation didn't. Thisleaves a gapbetween 2 file extent items of the inode, which makes btrfs' fsckcomplain about it.
For example, if we perform the following file operations:
    $ mkfs.btrfs -f /dev/vdd
    $ mount /dev/vdd /mnt
    $ xfs_io -f \
          -c "pwrite -S 0xaa -b 32K 0 32K" \
          -c "fsync" \
          -c "pwrite -S 0xbb -b 32770 16K 32770" \
          -c "truncate 90123" \
          /mnt/foobar
and the snapshot creation ioctl was just called before the secondwrite, we often
can get the following inode items in the snapshot's btree:

        item 120 key (257 INODE_ITEM 0) itemoff 7987 itemsize 160
inode generation 146 transid 7 size 90123 block group0 mode 100600 links 1 uid 0 gid 0 rdev 0 flags 0x0
        item 121 key (257 INODE_REF 256) itemoff 7967 itemsize 20
                inode ref index 282 namelen 10 name: foobar
        item 122 key (257 EXTENT_DATA 0) itemoff 7914 itemsize 53
                extent data disk byte 1104855040 nr 32768
                extent data offset 0 nr 32768 ram 32768
                extent compression 0
        item 123 key (257 EXTENT_DATA 53248) itemoff 7861 itemsize 53
                extent data disk byte 0 nr 0
                extent data offset 0 nr 40960 ram 40960
                extent compression 0
There's a file range, corresponding to the interval [32K; ALIGN(16K +32770, 4096)[for which there's no file extent item covering it. This is becausethe file writeand file truncate operations happened both right after the snapshotcreation ioctlcalled btrfs_start_delalloc_inodes(), which means we didn't start andwait for theordered extent that matches the write and, in btrfs_setsize(), wewere able to callbtrfs_cont_expand() before being able to commit the currenttransaction in thesnapshot creation ioctl. So this made it possibe to insert the holefile extentitem in the source subvolume (which represents the region added bythe truncate)
right before the transaction commit from the snapshot creation ioctl.
Btrfs' fsck tool complains about such cases with a message like thefollowing:
    "root 331 inode 257 errors 100, file extent discount"
From a user perspective, the expectation when a snapshot is createdwhile thosefile operations are being performed is that the snapshot will have afile that
either:

1) is empty
2) only the first write was captured
3) only the 2 writes were captured
4) both writes and the truncation were captured
But never capture a state where only the first write and thetruncation were
captured (since the second write was performed before the truncation).

A test case for xfstests follows.

Signed-off-by: Filipe Manana <fdman...@suse.com>
---
 fs/btrfs/inode.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 0d41741..c28b78f 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4622,6 +4622,9 @@ static int btrfs_setsize(struct inode *inode,struct iattr *attr)
        }

        if (newsize > oldsize) {
+               ret = btrfs_wait_ordered_range(inode, 0, (u64)-1);
+               if (ret)
+                       return ret;

Expanding truncates aren't my favorite operation, but we don't wantthem to imply fsync. I'm holding off on this one while I work out therest of the vacation backlog ;)


-chris



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: fix snapshot inconsistency after a file write followed by truncate

Reply via email to