On Thu, Dec 28, 2023 at 01:19:36PM -0800, Carl E. Thompson wrote: > Hello, there appears to be a bug in bcachefs in which certain changes to > subvolumes and snapshots can result in file contents not being read properly > when copied. Specifically, if a snapshot is created of a subvolume and a file > in either the subvolume or the snapshot is removed or modified, incorrect > data is read from the corresponding unmodified file in the subvolume or > snapshot if that corresponding file is copied. I've reproduced on multiple > systems running rc6 and rc7 but on my systems I can reproduce this 100% of > the time on bcachefs filesystems where the block size is 4k; I cannot > reproduce at all on filesystems with 512 byte blocks (but see below). I've > only tested with 4k and 512 block sizes. None of the other format options > I've tried made a difference in my tests including compression and bucket > size. > > Here is a short example: > --- > [carl@clip test]$ bcachefs subvolume create subvol > > [carl@clip test]$ echo "Test" > subvol/file > > [carl@clip test]$ bcachefs subvolume snapshot subvol snapshot_of_subvol > > [carl@clip test]$ rm subvol/file > > [carl@clip test]$ cat snapshot_of_subvol/file > Test > > [carl@clip test]$ cp snapshot_of_subvol/file file > > [carl@clip test]$ cat file > > [carl@clip test]$ ls -l file > -rw-r--r-- 1 carl carl 5 Dec 28 12:56 file > > [carl@clip test]$ hexdump -C file > 00000000 00 00 00 00 00 |.....| > 00000005 > > [carl@clip test]$ > --- > > - The copied file has the correct length but has zeroes instead of the > correct data > - The problem also occurs if the file in the **snapshot** is removed or > modified and the original file in the subvolume is copied > - In my tests I only see the problem on filesystems with 4k blocks. I've > never been able to reproduce with 512 byte blocks. However someone else on > Reddit says that it definitely happened to them on a filesystem with 512 byte > blocks (but they can't reproduce it now) > - The bug only happens if the file is copied to the same bcachefs filesystem. > It does **not** happen if the file is copied to a different bcachefs > filesystem or a different filesystem entirely > - The bug does **not** occur if the cp command is given the `--reflink=never` > option > - Discussion can be found on this Reddit thread (but not all comments are > correct: > https://www.reddit.com/r/bcachefs/comments/18sbl9z/how_do_you_restore_a_file_from_a_snapshot/ > > In my opinion this is a severe issue because it could lead to data loss if > users rely on copied files having the correct contents.
Fix is in the -testing branch, will be making it to the appropriate places shortly: commit 05b11fb627bcd71cc19e79cdeabfa1b514bf8acf Author: Kent Overstreet <[email protected]> Date: Fri Dec 29 13:39:07 2023 -0500 bcachefs: Fix extents iteration + snapshots interaction peek_upto() checks against the end position and bails out before FILTER_SNAPSHOTS checks; this is because if we end up at a different inode number than the original search key none of the keys we see might be visibile in the current snapshot - we might be looking at inode in a completely different subvolume. But this is broken, because when we're iterating over extents we're checking against the extent start position to decide when to bail out, and the extent start position isn't monotonically increasing until after we've run FILTER_SNAPSHOTS. Fix this by adding a simple inode number check where the old bailout check was, and moving the main check to the correct position. Signed-off-by: Kent Overstreet <[email protected]> diff --git a/fs/bcachefs/btree_iter.c b/fs/bcachefs/btree_iter.c index 732c8bb3f803..7e5c797cfaf2 100644 --- a/fs/bcachefs/btree_iter.c +++ b/fs/bcachefs/btree_iter.c @@ -2145,18 +2145,16 @@ struct bkey_s_c bch2_btree_iter_peek_upto(struct btree_iter *iter, struct bpos e goto out_no_locked; /* - * iter->pos should be mononotically increasing, and always be - * equal to the key we just returned - except extents can - * straddle iter->pos: + * We need to check against @end before FILTER_SNAPSHOTS because + * if we get to a different inode that requested we might be + * seeing keys for a different snapshot tree that will all be + * filtered out. + * + * But we can't do the full check here, because bkey_start_pos() + * isn't monotonically increasing before FILTER_SNAPSHOTS, and + * that's what we check against in extents mode: */ - if (!(iter->flags & BTREE_ITER_IS_EXTENTS)) - iter_pos = k.k->p; - else - iter_pos = bkey_max(iter->pos, bkey_start_pos(k.k)); - - if (unlikely(!(iter->flags & BTREE_ITER_IS_EXTENTS) - ? bkey_gt(iter_pos, end) - : bkey_ge(iter_pos, end))) + if (k.k->p.inode > end.inode) goto end; if (iter->update_path && @@ -2215,6 +2213,21 @@ struct bkey_s_c bch2_btree_iter_peek_upto(struct btree_iter *iter, struct bpos e continue; } + /* + * iter->pos should be mononotically increasing, and always be + * equal to the key we just returned - except extents can + * straddle iter->pos: + */ + if (!(iter->flags & BTREE_ITER_IS_EXTENTS)) + iter_pos = k.k->p; + else + iter_pos = bkey_max(iter->pos, bkey_start_pos(k.k)); + + if (unlikely(!(iter->flags & BTREE_ITER_IS_EXTENTS) + ? bkey_gt(iter_pos, end) + : bkey_ge(iter_pos, end))) + goto end; + break; }
