Re: [PATCH] btrfs: copy fsid to super_block s_uuid
Hi Darrick, Thanks for commenting.. + memcpy(>s_uuid, fs_info->fsid, BTRFS_FSID_SIZE); uuid_copy()? It requires a larger migration to use uuid_t, IMO it can be done all together, in a separate patch ? Just for experiment, starting with struct btrfs_fs_info.fsid and to check its foot prints, I just renamed fsid to fs_id, and compiled. It reports 73 'has no member named ‘fsid'' errors. So looks like redefining u8 fsid[] to uuid_t fsid and further updating all its foot prints, has to be simplified. Any suggestions ? Thanks, Anand -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs: verify_dir_item fails in replay_xattr_deletes
From: Su YueIn replay_xattr_deletes(), the argument @slot of verify_dir_item() should be variable @i instead of path->slots[0]. The bug causes failure of generic/066 and shared/002 in xfstest. dmesg: [12507.810781] BTRFS critical (device dm-0): invalid dir item name len: 10 [12507.811185] BTRFS: error (device dm-0) in btrfs_replay_log:2475: errno=-5 IO failure (Failed to recover log tree) [12507.811928] BTRFS error (device dm-0): cleaner transaction attach returned -30 [12507.821020] BTRFS error (device dm-0): open_ctree failed [12508.131526] BTRFS info (device dm-0): disk space caching is enabled [12508.132145] BTRFS info (device dm-0): has skinny extents [12508.136265] BTRFS critical (device dm-0): invalid dir item name len: 10 [12508.136678] BTRFS: error (device dm-0) in btrfs_replay_log:2475: errno=-5 IO failure (Failed to recover log tree) [12508.137501] BTRFS error (device dm-0): cleaner transaction attach returned -30 [12508.147982] BTRFS error (device dm-0): open_ctree failed Signed-off-by: Su Yue --- fs/btrfs/tree-log.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index f20ef211a73d..3a11ae63676e 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -2153,8 +2153,7 @@ static int replay_xattr_deletes(struct btrfs_trans_handle *trans, u32 this_len = sizeof(*di) + name_len + data_len; char *name; - ret = verify_dir_item(fs_info, path->nodes[0], - path->slots[0], di); + ret = verify_dir_item(fs_info, path->nodes[0], i, di); if (ret) { ret = -EIO; goto out; -- 2.13.3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Massive loss of disk space
Austin S. Hemmelgarn posted on Tue, 01 Aug 2017 10:47:30 -0400 as excerpted: > I think I _might_ understand what's going on here. Is that test program > calling fallocate using the desired total size of the file, or just > trying to allocate the range beyond the end to extend the file? I've > seen issues with the first case on BTRFS before, and I'm starting to > think that it might actually be trying to allocate the exact amount of > space requested by fallocate, even if part of the range is already > allocated space. If I've interpreted correctly (not being a dev, only a btrfs user, sysadmin, and list regular) previous discussions I've seen on this list... That's exactly what it's doing, and it's _intended_ behavior. The reasoning is something like this: fallocate is supposed to pre- allocate some space with the intent being that writes into that space won't fail, because the space is already allocated. For an existing file with some data already in it, ext4 and xfs do that counting the existing space. But btrfs is copy-on-write, meaning it's going to have to write the new data to a different location than the existing data, and it may well not free up the existing allocation (if even a single 4k block of the existing allocation remains unwritten, it will remain to hold down the entire previous allocation, which isn't released until *none* of it is still in use -- of course in normal usage "in use" can be due to old snapshots or other reflinks to the same extent, as well, tho in these test cases it's not). So in ordered to provide the writes to preallocated space shouldn't ENOSPC guarantee, btrfs can't count currently actually used space as part of the fallocate. The different behavior is entirely due to btrfs being COW, and thus a choice having to be made, do we worst-case fallocate-reserve for writes over currently used data that will have to be COWed elsewhere, possibly without freeing the existing extents because there's still something referencing them, or do we risk ENOSPCing on write to a previously fallocated area? The choice was to worst-case-reserve and take the ENOSPC risk at fallocate time, so the write into that fallocated space could then proceed without the ENOSPC risk that COW would otherwise imply. Make sense, or is my understanding a horrible misunderstanding? =:^) So if you're actually only appending, fallocate the /additional/ space, not the /entire/ space, and you'll get what you need. But if you're potentially overwriting what's there already, better fallocate the entire space, which triggers the btrfs worst-case allocation behavior you see, in ordered to guarantee it won't ENOSPC during the actual write. Of course the only time the behavior actually differs is with COW, but then there's a BIG difference, but that BIG difference has a GOOD BIG reason! =:^) Tho that difference will certainly necessitate some relearning the /correct/ way to do it, for devs who were doing it the COW-worst-case way all along, even if they didn't actually need to, because it didn't happen to make a difference on what they happened to be testing on, which happened not to be COW... Reminds me of the way newer versions of gcc and/or trying to build with clang as well tends to trigger relearning, because newer versions are stricter in ordered to allow better optimization, and other implementations are simply different in what they're strict on, /because/ they're a different implementation. Well, btrfs is stricter... because it's a different implementation that /has/ to be stricter... due to COW. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS error: bad tree block start 0 623771648
Roman Mamedov posted on Tue, 01 Aug 2017 11:08:05 +0500 as excerpted: > On Sun, 30 Jul 2017 18:14:35 +0200 "marcel.cochem" >wrote: > >> I am pretty sure that not all data is lost as i can grep thorugh the >> 100 GB SSD partition. But my question is, if there is a tool to rescue >> all (intact) data and maybe have only a few corrupt files which can't >> be recovered. > > There is such a tool, see > https://btrfs.wiki.kernel.org/index.php/Restore I was going to suggest that too... and even started a reply to do so... upon which I read a bit closer and saw he'd actually tried restore already... And before you suggest it, he tried btrfs-find-root as well, and it didn't work either, so he can't do the advanced/technical mode of restore, feeding it addresses from btrfs-find-root, either. =:^( It's in the post... So unfortunately he's pretty much left with manual hacking and scraping, and that's at a level beyond what I at least am able to help him with... -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs + compression = slow performance and high cpu usage
[ ... ] > This is the "storage for beginners" version, what happens in > practice however depends a lot on specific workload profile > (typical read/write size and latencies and rates), caching and > queueing algorithms in both Linux and the HA firmware. To add a bit of slightly more advanced discussion, the main reason for larger strips ("chunk size) is to avoid the huge latencies of disk rotation using unsynchronized disk drives, as detailed here: http://www.sabi.co.uk/blog/12-thr.html?120310#120310 That relates weakly to Btrfs. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS error: bad tree block start 0 623771648
On Tue, Aug 01, 2017 at 11:04:10AM +0500, Roman Mamedov wrote: > On Mon, 31 Jul 2017 11:12:01 -0700 > Liu Bowrote: > > > Superblock and chunk tree root is OK, looks like the header part of > > the tree root is now all-zero, but I'm unable to think of a btrfs bug > > which can lead to that (if there is, it is a serious enough one) > > I see that the FS is being mounted with "discard". So maybe it was a TRIM gone > bad (wrong location or in a wrong sequence). > By checking discard path in btrfs, looks OK to me, more likely it's caused by problems from underlying stuff. Thanks, -liubo > Generally it appears to be not recommended to use "discard" by now (because of > its performance impact, and maybe possible issues like this), instead schedule > to call "fstrim " once a day or so, and/or on boot-up. > > > on ssd like disks, by default there is only one copy for metadata. > > Time and time again, the default of "single" metadata for SSD is a terrible > idea. Most likely DUP metadata would save the FS in this case. > > -- > With respect, > Roman > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes
On 2017-08-01 23:00, Christoph Anton Mitterer wrote: > Hi. > > Stupid question: > Would the write hole be closed already, if parity was checksummed? No. The write hole problem is due to a combination of two things: a) misalignment between parity and data (i.e. unclean shutdown) b) loosing of a disk (i.e. disk rupture) Note1: the write hole problem happens even if these two event are not consecutive. After the disk rupture, when you need to read a data from the broken disk, you need the parity to compute the data. But if the parity is misaligned, wrong data is returned. The data checksum are sufficient to detect if wrong data is returned. The checksum parity is not needed. In any case both can't avoid the problem. > Cheers, > Chris. > BR G.Baroncelli -- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes
On 2017-08-01 19:24, Liu Bo wrote: > On Tue, Aug 01, 2017 at 07:42:14PM +0200, Goffredo Baroncelli wrote: >> Hi Liu, >> >> On 2017-08-01 18:14, Liu Bo wrote: >>> This aims to fix write hole issue on btrfs raid5/6 setup by adding a >>> separate disk as a journal (aka raid5/6 log), so that after unclean >>> shutdown we can make sure data and parity are consistent on the raid >>> array by replaying the journal. >>> >> >> it would be possible to have more information ? >> - what is logged ? data, parity or data + parity ? > > Patch 5 has more details(sorry for not making it clear that in the > cover letter). > > So both data and parity are logged so that while replaying the journal > everything is written to whichever disk it should be written to. It is correct reading this as: all data is written two times ? Or are logged only the stripes involved by a RMW cycle (i.e. if a stripe is fully written, the log is bypassed )? > >> - in the past I thought that it would be sufficient to log only the stripe >> position involved by a RMW cycle, and then start a scrub on these stripes in >> case of an unclean shutdown: do you think that it is feasible ? > > An unclean shutdown causes inconsistence between data and parity, so > scrub won't help as it's not able to tell which one (data or parity) > is valid Scrub compares data against its checksum; so it knows if the data is correct. If no disk is lost, a scrub process is sufficient/needed to rebuild the parity/data. The problem born when after "an unclean shutdown" a disk failure happens. But these are *two* distinct failures. These together break the BTRFS raid5 redundancy. But if you run a scrub process between these two failures, the btrfs raid5 redundancy is still effective. > > With nodatacow, we do overwrite, so RMW during unclean shutdown is not safe. > With datacow, we don't do overwrite, but the following situation may happen, > say we have a raid5 setup with 3 disks, the stripe length is 64k, so > > 1) write 64K --> now the raid layout is > [64K data + 64K random + 64K parity] > 2) write another 64K --> now the raid layout after RMW is > [64K 1)'s data + 64K 2)'s data + 64K new parity] > > If unclean shutdown occurs before 2) finishes, then parity may be > corrupted and then 1)'s data may be recovered wrongly if the disk > which holds 1)'s data is offline. > >> - does this journal disk also host other btrfs log ? >> > > No, purely data/parity and some associated metadata. > > Thanks, > > -liubo > >>> The idea and the code are similar to the write-through mode of md >>> raid5-cache, so ppl(partial parity log) is also feasible to implement. >>> (If you've been familiar with md, you may find this patch set is >>> boring to read...) >>> >>> Patch 1-3 are about adding a log disk, patch 5-8 are the main part of >>> the implementation, the rest patches are improvements and bugfixes, >>> eg. readahead for recovery, checksum. >>> >>> Two btrfs-progs patches are required to play with this patch set, one >>> is to enhance 'btrfs device add' to add a disk as raid5/6 log with the >>> option '-L', the other is to teach 'btrfs-show-super' to show >>> %journal_tail. >>> >>> This is currently based on 4.12-rc3. >>> >>> The patch set is tagged with RFC, and comments are always welcome, >>> thanks. >>> >>> Known limitations: >>> - Deleting a log device is not implemented yet. >>> >>> >>> Liu Bo (14): >>> Btrfs: raid56: add raid56 log via add_dev v2 ioctl >>> Btrfs: raid56: do not allocate chunk on raid56 log >>> Btrfs: raid56: detect raid56 log on mount >>> Btrfs: raid56: add verbose debug >>> Btrfs: raid56: add stripe log for raid5/6 >>> Btrfs: raid56: add reclaim support >>> Btrfs: raid56: load r5log >>> Btrfs: raid56: log recovery >>> Btrfs: raid56: add readahead for recovery >>> Btrfs: raid56: use the readahead helper to get page >>> Btrfs: raid56: add csum support >>> Btrfs: raid56: fix error handling while adding a log device >>> Btrfs: raid56: initialize raid5/6 log after adding it >>> Btrfs: raid56: maintain IO order on raid5/6 log >>> >>> fs/btrfs/ctree.h| 16 +- >>> fs/btrfs/disk-io.c | 16 + >>> fs/btrfs/ioctl.c| 48 +- >>> fs/btrfs/raid56.c | 1429 >>> ++- >>> fs/btrfs/raid56.h | 82 +++ >>> fs/btrfs/transaction.c |2 + >>> fs/btrfs/volumes.c | 56 +- >>> fs/btrfs/volumes.h |7 +- >>> include/uapi/linux/btrfs.h |3 + >>> include/uapi/linux/btrfs_tree.h |4 + >>> 10 files changed, 1487 insertions(+), 176 deletions(-) >>> >> >> >> -- >> gpg @keyserver.linux.it: Goffredo Baroncelli >> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at
Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes
Hi. Stupid question: Would the write hole be closed already, if parity was checksummed? Cheers, Chris. smime.p7s Description: S/MIME cryptographic signature
Re: Slow mounting raid1
2017-08-01 23:21 GMT+03:00 Leonidas Spyropoulos: > On 01/08/17, E V wrote: >> In general I think btrfs takes time proportional to the size of your >> metadata to mount. Bigger and/or fragmented metadata leads to longer >> mount times. My big backup fs with >300GB of metadata takes over >> 20minutes to mount, and that's with the space tree which is >> significantly faster then space cache v1. >> > Hmm my raid1 doesn't seem near to full or has a significant Metadata so > I don't I'm on this case: > # btrfs fi show /media/raid1/ > Label: 'raid1' uuid: c9db91e6-0ba8-4ae6-b471-8fd4ff7ee72d > Total devices 2 FS bytes used 516.18GiB > devid1 size 931.51GiB used 518.03GiB path /dev/sdd > devid2 size 931.51GiB used 518.03GiB path /dev/sde > > # btrfs fi df /media/raid1/ > Data, RAID1: total=513.00GiB, used=512.21GiB > System, RAID1: total=32.00MiB, used=112.00KiB > Metadata, RAID1: total=5.00GiB, used=3.97GiB > GlobalReserve, single: total=512.00MiB, used=0.00B > > I tried the space_cache=v2 just to see if it would do any > difference but nothing changed > # cat /etc/fstab | grep raid1 > UUID=c9db91e6-0ba8-4ae6-b471-8fd4ff7ee72d /media/raid1 btrfs > rw,noatime,compress=lzo,space_cache=v2 0 0 > # time umount /media/raid1 && time mount /media/raid1/ > > real0m0.807s > user0m0.237s > sys 0m0.441s > > real0m5.494s > user0m0.618s > sys 0m0.116s > > I did a couple of rebalances on metadata and data and it improved a bit: > # btrfs balance start -musage=100 /media/raid1/ > # btrfs balance start -dusage=10 /media/raid1/ > [.. incremental dusage 10 -> 95] > # btrfs balance start -dusage=95 /media/raid1 > > Down to 3.7 sec > # time umount /media/raid1 && time mount /media/raid1/ > > real0m0.807s > user0m0.237s > sys 0m0.441s > > real0m3.790s > user0m0.430s > sys 0m0.031s > > I think maybe the next step is to disable compression if I want to mount > it faster. Is this normal for BTRFS that performance would degrade after > some time? > > Regards, > > -- > Leonidas Spyropoulos > > A: Because it messes up the order in which people normally read text. > Q: Why is it such a bad thing? > A: Top-posting. > Q: What is the most annoying thing on usenet and in e-mail? > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html AFAIK, for space_cache=v2, you need do something like: btrfs check --clear-space-cache v1 /dev/sdd mount -o space_cache=v2 /dev/sdd First mount will be very slow, because that require rebuild of space_cache Thanks. -- Have a nice day, Timofey. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs incremental send | receive fails with Error: File not found
Then following problem is directly related with that: https://unix.stackexchange.com/questions/377914/how-to-test-if-two-btrfs-snapshots-are-identical Is that a bug or a feature? 2017-08-01 23:33 GMT+03:00 A L: > > On 8/1/2017 10:24 PM, Cerem Cem ASLAN wrote: >> >> What is that mean? Can't we replicate the same snapshot with `btrfs send | >> btrfs receive` multiple times, because it will have a "Received UUID" at the >> first `btrfs receive > > > You will need to make a new read-write snapshot of the received volume to > fix it. Any snapshots created from the received subvolume can't be used for > send-receive again, afaik. > > # btrfs subvolume snapshot subvolume.received subvolume -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs incremental send | receive fails with Error: File not found
On 8/1/2017 10:24 PM, Cerem Cem ASLAN wrote: What is that mean? Can't we replicate the same snapshot with `btrfs send | btrfs receive` multiple times, because it will have a "Received UUID" at the first `btrfs receive You will need to make a new read-write snapshot of the received volume to fix it. Any snapshots created from the received subvolume can't be used for send-receive again, afaik. # btrfs subvolume snapshot subvolume.received subvolume -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
cant removed a corrupted dir and some files, btrfs-check crashes
Hi/2 all... I've been using btrfs for years without any major issues (ok, not true but it was always my own fault). This time around i was testing the new BFQ in 4.12 (hint, dont use it for heavy i/o), and the laptop froze solid. So far so good, the thing is, once i rebooted i had a corrupted qcow2 image and a dir with some files that cant be removed or fixed. The qcow2 i fixed easily, i understand that using compression with sparse files might result it corruption, so no btrfs fault here. Now, the "unkillable" dir is another story. every time i run btrfs scrub i get: root@kerberos:~# btrfs check /dev/sda3 Checking filesystem on /dev/sda3 UUID: 619d4eb2-2c94-438b-9b9e-182ed969ad61 checking extents checking free space cache checking fs roots root 258 inode 51958 errors 100, file extent discount Found file extent holes: start: 8192, len: 499712 root 258 inode 4522616 errors 200, dir isize wrong unresolved ref dir 4522616 index 3 namelen 84 name ECRYPTFS_FNEK_ENCRYPTED.FWY9yE0bBJvSm-S7O71XRP3G1sxpTJjXp7LPoLQZgemwYbF7SG.ZpQvWlk-- filetype 1 errors 2, no dir index root 258 inode 6036422 errors 1, no inode item unresolved ref dir 4522616 index 46227 namelen 104 name ECRYPTFS_FNEK_ENCRYPTED.FXY9yE0bBJvSm-S7O71XRP3G1sxpTJjXp7LPXtMCQ9BwG3JHBHoMOf9hI0EvP6p11X8OCd8Iew1bYMQ- filetype 1 errors 5, no dir item, no inode ref root 258 inode 6036423 errors 1, no inode item unresolved ref dir 4522616 index 46229 namelen 84 name ECRYPTFS_FNEK_ENCRYPTED.FWY9yE0bBJvSm-S7O71XRP3G1sxpTJjXp7LPCSCfQYa2WG4o8T93CrHv0k-- filetype 1 errors 5, no dir item, no inode ref root 258 inode 8792178 errors 1, no inode item unresolved ref dir 4522616 index 133165 namelen 84 name ECRYPTFS_FNEK_ENCRYPTED.FWY9yE0bBJvSm-S7O71XRP3G1sxpTJjXp7LPCSCfQYa2WG4o8T93CrHv0k-- filetype 1 errors 5, no dir item, no inode ref root 258 inode 8792183 errors 1, no inode item unresolved ref dir 4522616 index 133167 namelen 104 name ECRYPTFS_FNEK_ENCRYPTED.FXY9yE0bBJvSm-S7O71XRP3G1sxpTJjXp7LPXtMCQ9BwG3JHBHoMOf9hI0EvP6p11X8OCd8Iew1bYMQ- filetype 1 errors 5, no dir item, no inode ref ERROR: errors found in fs roots found 109814329344 bytes used, error(s) found total csum bytes: 106513584 total tree bytes: 672759808 total fs tree bytes: 455245824 total extent tree bytes: 82968576 btree space waste bytes: 139453318 file data blocks allocated: 581567266816 referenced 103288545280 as you may have guessed, its a btrfs + ecryptfs mount, i tried to delete the inodes but the system cant "find" them. when i try a btrfs check --repair i get: root@kerberos:~# btrfs check --repair /dev/sda3 enabling repair mode Checking filesystem on /dev/sda3 UUID: 619d4eb2-2c94-438b-9b9e-182ed969ad61 checking extents Unable to find block group for 0 extent-tree.c:287: find_search_start: Warning: assertion `1` failed, value 1 btrfs(+0x20c38)[0x936fcaac38] btrfs(btrfs_reserve_extent+0x585)[0x936fcaee61] btrfs(btrfs_alloc_free_block+0x63)[0x936fcaf229] btrfs(__btrfs_cow_block+0xfe)[0x936fca30b9] btrfs(btrfs_cow_block+0xc4)[0x936fca366f] btrfs(+0x1d7ca)[0x936fca77ca] btrfs(btrfs_commit_transaction+0xac)[0x936fca8f4a] btrfs(+0x5557b)[0x936fcdf57b] btrfs(cmd_check+0x1309)[0x936fce09ac] btrfs(main+0x142)[0x936fca20d9] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf1)[0x7f956c3153f1] btrfs(_start+0x2a)[0x936fca211a] Unable to find block group for 0 extent-tree.c:287: find_search_start: Warning: assertion `1` failed, value 1 btrfs(+0x20c38)[0x936fcaac38] btrfs(btrfs_reserve_extent+0x585)[0x936fcaee61] btrfs(btrfs_alloc_free_block+0x63)[0x936fcaf229] btrfs(__btrfs_cow_block+0xfe)[0x936fca30b9] btrfs(btrfs_cow_block+0xc4)[0x936fca366f] btrfs(+0x1d7ca)[0x936fca77ca] btrfs(btrfs_commit_transaction+0xac)[0x936fca8f4a] btrfs(+0x5557b)[0x936fcdf57b] btrfs(cmd_check+0x1309)[0x936fce09ac] btrfs(main+0x142)[0x936fca20d9] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf1)[0x7f956c3153f1] btrfs(_start+0x2a)[0x936fca211a] Unable to find block group for 0 extent-tree.c:287: find_search_start: Warning: assertion `1` failed, value 1 btrfs(+0x20c38)[0x936fcaac38] btrfs(btrfs_reserve_extent+0x585)[0x936fcaee61] btrfs(btrfs_alloc_free_block+0x63)[0x936fcaf229] btrfs(__btrfs_cow_block+0xfe)[0x936fca30b9] btrfs(btrfs_cow_block+0xc4)[0x936fca366f] btrfs(+0x1d7ca)[0x936fca77ca] btrfs(btrfs_commit_transaction+0xac)[0x936fca8f4a] btrfs(+0x5557b)[0x936fcdf57b] btrfs(cmd_check+0x1309)[0x936fce09ac] btrfs(main+0x142)[0x936fca20d9] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf1)[0x7f956c3153f1] btrfs(_start+0x2a)[0x936fca211a] extent-tree.c:2694: btrfs_reserve_extent: BUG_ON `ret` triggered, value -28 btrfs(+0x20c38)[0x936fcaac38] btrfs(+0x20ca8)[0x936fcaaca8] btrfs(+0x20cbb)[0x936fcaacbb] btrfs(btrfs_reserve_extent+0x751)[0x936fcaf02d] btrfs(btrfs_alloc_free_block+0x63)[0x936fcaf229] btrfs(__btrfs_cow_block+0xfe)[0x936fca30b9] btrfs(btrfs_cow_block+0xc4)[0x936fca366f] btrfs(+0x1d7ca)[0x936fca77ca] btrfs(btrfs_commit_transaction+0xac)[0x936fca8f4a]
Re: Btrfs incremental send | receive fails with Error: File not found
What is that mean? Can't we replicate the same snapshot with `btrfs send | btrfs receive` multiple times, because it will have a "Received UUID" at the first `btrfs receive`? 2017-08-01 15:54 GMT+03:00 A L: > OK. The problem was that the original subvolume had a "Received UUID". This > caused all subsequent snapshots to have the same Received UUID which messes > up Btrfs send | receive. Of course this means I must have used btrfs send | > receive to create that subvolume and then turned it r/w at some point, > though I cannot remember ever doing this. > > Perhaps a clear notice "WARNING: make sure that the source subvolume does > not have a Received UUID" on the Wiki would be helpful? Both on > https://btrfs.wiki.kernel.org/index.php/Incremental_Backup and on > https://btrfs.wiki.kernel.org/index.php/Manpage/btrfs-property > > Regards, > A > > > On 7/28/2017 9:32 PM, Hermann Schwärzler wrote: >> >> Hi >> >> for me it looks like those snapshots are not read-only. But as far as I >> know for using send they have to be. >> >> At least >> >> https://btrfs.wiki.kernel.org/index.php/Incremental_Backup#Initial_Bootstrapping >> states "We will need to create a read-only snapshot ,,," >> >> I am using send/receive (with read-only snapshots) on a regular basis and >> never had a problem like yours. >> >> What are the commands you use to create your snapshots? >> >> Greetings >> Hermann >> >> On 07/28/2017 07:26 PM, A L wrote: >>> >>> I often hit the following error when doing incremental btrfs >>> send-receive: >>> Btrfs incremental send | receive fails with Error: File not found >>> >>> Sometimes I can do two-three incremental snapshots, but then the same >>> error (different file) happens again. It seems that the files were >>> changed or replaced between snapshots, which is causing the problems for >>> send-receive. I have tried to delete all snapshots and started over but >>> the problem comes back, so I think it must be a bug. >>> >>> The source volume is: /mnt/storagePool (with RAID1 profile) >>> with subvolume: volume/userData >>> Backup disk is: /media/usb-backup (external USB disk) >> >> [...] >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Slow mounting raid1
On 01/08/17, E V wrote: > In general I think btrfs takes time proportional to the size of your > metadata to mount. Bigger and/or fragmented metadata leads to longer > mount times. My big backup fs with >300GB of metadata takes over > 20minutes to mount, and that's with the space tree which is > significantly faster then space cache v1. > Hmm my raid1 doesn't seem near to full or has a significant Metadata so I don't I'm on this case: # btrfs fi show /media/raid1/ Label: 'raid1' uuid: c9db91e6-0ba8-4ae6-b471-8fd4ff7ee72d Total devices 2 FS bytes used 516.18GiB devid1 size 931.51GiB used 518.03GiB path /dev/sdd devid2 size 931.51GiB used 518.03GiB path /dev/sde # btrfs fi df /media/raid1/ Data, RAID1: total=513.00GiB, used=512.21GiB System, RAID1: total=32.00MiB, used=112.00KiB Metadata, RAID1: total=5.00GiB, used=3.97GiB GlobalReserve, single: total=512.00MiB, used=0.00B I tried the space_cache=v2 just to see if it would do any difference but nothing changed # cat /etc/fstab | grep raid1 UUID=c9db91e6-0ba8-4ae6-b471-8fd4ff7ee72d /media/raid1 btrfs rw,noatime,compress=lzo,space_cache=v2 0 0 # time umount /media/raid1 && time mount /media/raid1/ real0m0.807s user0m0.237s sys 0m0.441s real0m5.494s user0m0.618s sys 0m0.116s I did a couple of rebalances on metadata and data and it improved a bit: # btrfs balance start -musage=100 /media/raid1/ # btrfs balance start -dusage=10 /media/raid1/ [.. incremental dusage 10 -> 95] # btrfs balance start -dusage=95 /media/raid1 Down to 3.7 sec # time umount /media/raid1 && time mount /media/raid1/ real0m0.807s user0m0.237s sys 0m0.441s real0m3.790s user0m0.430s sys 0m0.031s I think maybe the next step is to disable compression if I want to mount it faster. Is this normal for BTRFS that performance would degrade after some time? Regards, -- Leonidas Spyropoulos A: Because it messes up the order in which people normally read text. Q: Why is it such a bad thing? A: Top-posting. Q: What is the most annoying thing on usenet and in e-mail? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs + compression = slow performance and high cpu usage
>> [ ... ] a "RAID5 with 128KiB writes and a 768KiB stripe >> size". [ ... ] several back-to-back 128KiB writes [ ... ] get >> merged by the 3ware firmware only if it has a persistent >> cache, and maybe your 3ware does not have one, > KOS: No I don't have persistent cache. Only the 512 Mb cache > on board of a controller, that is BBU. If it is a persistent cache, that can be battery-backed (as I wrote, but it seems that you don't have too much time to read replies) then the size of the write, 128KiB or not, should not matter much; the write will be reported complete when it hits the persistent cache (whichever technology it used), and then the HA fimware will spill write cached data to the disks using the optimal operation width. Unless the 3ware firmware is really terrible (and depending on model and vintage it can be amazingly terrible) or the battery is no longer recharging and then the host adapter switches to write-through. That you see very different rates between uncompressed and compressed writes, where the main difference is the limitation on the segment size, seems to indicate that compressed writes involve a lot of RMW, that is sub-stripe updates. As I mentioned already, it would be interesting to retry 'dd' with different 'bs' values without compression and with 'sync' (or 'direct' which only makes sense without compression). > If I had additional SSD caching on the controller I would have > mentioned it. So far you had not mentioned the presence of BBU cache either, which is equivalent, even if in one of your previous message (which I try to read carefully) there were these lines: Default Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU Current Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU So perhaps someone else would have checked long ago the status of the BBU and whether the "No Write Cache if Bad BBU" case has happened. If the BBU is still working and the policy is still "WriteBack" then things are stranger still. > I was also under impression, that in a situation where mostly > extra large files will be stored on the massive, the bigger > strip size would indeed increase the speed, thus I went with > with the 256 Kb strip size. That runs counter to this simple story: suppose a program is doing 64KiB IO: * For *reads*, there are 4 data drives and the strip size is 16KiB: the 64KiB will be read in parallel on 4 drives. If the strip size is 256KiB then the 64KiB will be read sequentially from just one disk, and 4 successive reads will be read sequentially from the same drive. * For *writes* on a parity RAID like RAID5 things are much, much more extreme: the 64KiB will be written with 16KiB strips on a 5-wide RAID5 set in parallel to 5 drives, with 4 stripes being updated with RMW. But with 256KiB strips it will partially update 5 drives, because the stripe is 1024+256KiB, and it needs to do RMW, and four successive 64KiB drives will need to do that too, even if only one drive is updated. Usually for RAID5 there is an optimization that means that only the specific target drive and the parity drives(s) need RMW, but it is still very expensive. This is the "storage for beginners" version, what happens in practice however depends a lot on specific workload profile (typical read/write size and latencies and rates), caching and queueing algorithms in both Linux and the HA firmware. > Would I be correct in assuming that the RAID strip size of 128 > Kb will be a better choice if one plans to use the BTRFS with > compression? That would need to be tested, because of "depends a lot on specific workload profile, caching and queueing algorithms", but my expectation is the the lower the better. Given that you have 4 drives giving a 3+1 RAID set, perhaps a 32KiB or 64KiB strip size, given a data stripe size of 96KiB or 192KiB, would be better. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] btrfs: increase ctx->pos for delayed dir index
From: Josef BacikOur dir_context->pos is supposed to hold the next position we're supposed to look. If we successfully insert a delayed dir index we could end up with a duplicate entry because we don't increase ctx->pos after doing the dir_emit. Signed-off-by: Josef Bacik --- fs/btrfs/delayed-inode.c | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c index 8ae409b..19e4ad2 100644 --- a/fs/btrfs/delayed-inode.c +++ b/fs/btrfs/delayed-inode.c @@ -1727,6 +1727,7 @@ int btrfs_readdir_delayed_dir_index(struct dir_context *ctx, if (over) return 1; + ctx->pos++; } return 0; } -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2][v3] btrfs: fix readdir deadlock with pagefault
From: Josef BacikReaddir does dir_emit while under the btree lock. dir_emit can trigger the page fault which means we can deadlock. Fix this by allocating a buffer on opening a directory and copying the readdir into this buffer and doing dir_emit from outside of the tree lock. Signed-off-by: Josef Bacik --- v2->v3: actually set the filp->private_data properly for ioctl trans. fs/btrfs/ctree.h | 5 +++ fs/btrfs/file.c | 9 - fs/btrfs/inode.c | 107 +-- fs/btrfs/ioctl.c | 22 4 files changed, 109 insertions(+), 34 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 5ee9f10..33e942b 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1264,6 +1264,11 @@ struct btrfs_root { atomic64_t qgroup_meta_rsv; }; +struct btrfs_file_private { + struct btrfs_trans_handle *trans; + void *filldir_buf; +}; + static inline u32 btrfs_inode_sectorsize(const struct inode *inode) { return btrfs_sb(inode->i_sb)->sectorsize; diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 0f102a1..1897c3b 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -1973,8 +1973,15 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb, int btrfs_release_file(struct inode *inode, struct file *filp) { - if (filp->private_data) + struct btrfs_file_private *private = filp->private_data; + + if (private && private->trans) btrfs_ioctl_trans_end(filp); + if (private && private->filldir_buf) + kfree(private->filldir_buf); + kfree(private); + filp->private_data = NULL; + /* * ordered_data_close is set by settattr when we are about to truncate * a file from a non-zero size to a zero size. This tries to diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 9a4413a..bbdbeea 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -5877,25 +5877,73 @@ unsigned char btrfs_filetype_table[] = { DT_UNKNOWN, DT_REG, DT_DIR, DT_CHR, DT_BLK, DT_FIFO, DT_SOCK, DT_LNK }; +/* + * All this infrastructure exists because dir_emit can fault, and we are holding + * the tree lock when doing readdir. For now just allocate a buffer and copy + * our information into that, and then dir_emit from the buffer. This is + * similar to what NFS does, only we don't keep the buffer around in pagecache + * because I'm afraid I'll fuck that up. Long term we need to make filldir do + * copy_to_user_inatomic so we don't have to worry about page faulting under the + * tree lock. + */ +static int btrfs_opendir(struct inode *inode, struct file *file) +{ + struct btrfs_file_private *private; + + private = kzalloc(sizeof(struct btrfs_file_private), GFP_KERNEL); + if (!private) + return -ENOMEM; + private->filldir_buf = kzalloc(PAGE_SIZE, GFP_KERNEL); + if (!private->filldir_buf) { + kfree(private); + return -ENOMEM; + } + file->private_data = private; + return 0; +} + +struct dir_entry { + u64 ino; + u64 offset; + unsigned type; + int name_len; +}; + +static int btrfs_filldir(void *addr, int entries, struct dir_context *ctx) +{ + while (entries--) { + struct dir_entry *entry = addr; + char *name = (char *)(entry + 1); + ctx->pos = entry->offset; + if (!dir_emit(ctx, name, entry->name_len, entry->ino, + entry->type)) + return 1; + addr += sizeof(struct dir_entry) + entry->name_len; + ctx->pos++; + } + return 0; +} + static int btrfs_real_readdir(struct file *file, struct dir_context *ctx) { struct inode *inode = file_inode(file); struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); struct btrfs_root *root = BTRFS_I(inode)->root; + struct btrfs_file_private *private = file->private_data; struct btrfs_dir_item *di; struct btrfs_key key; struct btrfs_key found_key; struct btrfs_path *path; + void *addr; struct list_head ins_list; struct list_head del_list; int ret; struct extent_buffer *leaf; int slot; - unsigned char d_type; - int over = 0; - char tmp_name[32]; char *name_ptr; int name_len; + int entries = 0; + int total_len = 0; bool put = false; struct btrfs_key location; @@ -5906,12 +5954,14 @@ static int btrfs_real_readdir(struct file *file, struct dir_context *ctx) if (!path) return -ENOMEM; + addr = private->filldir_buf; path->reada = READA_FORWARD; INIT_LIST_HEAD(_list); INIT_LIST_HEAD(_list); put = btrfs_readdir_get_delayed_items(inode, _list, _list); +again: key.type = BTRFS_DIR_INDEX_KEY;
Re: Odd fallocate behavior on BTRFS.
On 2017-08-01 15:07, Holger Hoffstätte wrote: On 08/01/17 20:15, Holger Hoffstätte wrote: On 08/01/17 19:34, Austin S. Hemmelgarn wrote: [..] Apparently, if you call fallocate() on a file with an offset of 0 and a length longer than the length of the file itself, BTRFS will allocate that exact amount of space, instead of just filling in holes in the file and allocating space to extend it. If there isn't enough space on the filesystem for this, then it will fail, even though it would succeed on ext4, XFS, and F2FS. [..] I'm curious to hear anybody's thoughts on this, namely: 1. Is this behavior that should be considered implementation defined? 2. If not, is my assessment that BTRFS is behaving incorrectly in this case accurate? IMHO no and yes, respectively. Both fallocate(2) and posix_fallocate(3) make it very clear that the expected default behaviour is to extend. I don't think this can be interpreted in any other way than incorrect behaviour on behalf of btrfs. Your script reproduces for me, so that's a start. Your reproducer should never ENOSPC because it requires exactly 0 new bytes to be allocated, yet it also fails with --keep-size. Unless I'm doing the math wrong, it should require exactly 2 new bytes. 65536 (the block size for dd) times 32768 (the block count for dd) is 2147483648 (2^31), while the fallocate call requests a total size of 2147483650 bytes. It may not need to allocate a new block, but it should definitely be extending the file.> From a quick look it seems that btrfs_fallocate() unconditionally calls btrfs_alloc_data_chunk_ondemand(inode, alloc_end - alloc_start) to lazily allocate the necessary extent(s), which goes ENOSPC because that size is again the full size of the requested range, not the difference between the existing file size and the new range length. But I might be misreading things.. As far as I can tell, that is correct. However, we can't just extend the range, because the existing file might have sparse regions, and those need to have allocations forced too (and based on the code, this will also cause issues any time the fallocate range includes already allocated extents, so I don't think it can be special cased either). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Odd fallocate behavior on BTRFS.
On 08/01/17 20:15, Holger Hoffstätte wrote: > On 08/01/17 19:34, Austin S. Hemmelgarn wrote: > [..] >> Apparently, if you call fallocate() on a file with an offset of 0 and >> a length longer than the length of the file itself, BTRFS will >> allocate that exact amount of space, instead of just filling in holes >> in the file and allocating space to extend it. If there isn't enough >> space on the filesystem for this, then it will fail, even though it >> would succeed on ext4, XFS, and F2FS. > [..] >> I'm curious to hear anybody's thoughts on this, namely: 1. Is this >> behavior that should be considered implementation defined? 2. If not, >> is my assessment that BTRFS is behaving incorrectly in this case >> accurate? > > IMHO no and yes, respectively. Both fallocate(2) and posix_fallocate(3) > make it very clear that the expected default behaviour is to extend. > I don't think this can be interpreted in any other way than incorrect > behaviour on behalf of btrfs. > > Your script reproduces for me, so that's a start. Your reproducer should never ENOSPC because it requires exactly 0 new bytes to be allocated, yet it also fails with --keep-size. >From a quick look it seems that btrfs_fallocate() unconditionally calls btrfs_alloc_data_chunk_ondemand(inode, alloc_end - alloc_start) to lazily allocate the necessary extent(s), which goes ENOSPC because that size is again the full size of the requested range, not the difference between the existing file size and the new range length. But I might be misreading things.. -h -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid0 rescue
On Tue, Aug 1, 2017 at 12:36 PM, Alan Brandwrote: > I successfully repaired the superblock, copied it from one of the backups. > My biggest problem now is that the UUID for the disk has changed due > to the reformatting and no longer matches what is in the metadata. > I need to make linux recognize the partition as btrfs and have the correct > UUID. > Any suggestions? Huh, insofar as I'm aware, Btrfs does not track a "disk" UUID or partition UUID. A better qualified set of steps for fixing this would be: a.) restore partitioning, if any b.) wipefs the NTFS signature to invalidate the NTFS file system c.) use super-recover to replace correct supers on both drives d.) mount the file system e.) do a full scrub The last step is optional but best practice. It'll actively do fixups, and you'll get an error message with path to files that are not recoverable. Alternatively a metadata only balance will do fixups, and it'll be much faster. But you won't get info right away about what files are damaged. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
kohler
你有美元,教你躺着赚钱: 您的朋友在科勒(中国)官网上找到一个认为您可能感兴趣的内容,并分享给您: 链接地址:http://www.kohler.com.cn/product/K-1809T-0/K-1809T-0/ 他对您留言:高价收美元,先结人民币给你,再结美金,让您无后顾之忧,详情请加黄生:扣 扣 7825 53723,手 机:135-3541-5522,V信同号,欢迎前来咨询
Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes
On Tue, Aug 01, 2017 at 07:42:14PM +0200, Goffredo Baroncelli wrote: > Hi Liu, > > On 2017-08-01 18:14, Liu Bo wrote: > > This aims to fix write hole issue on btrfs raid5/6 setup by adding a > > separate disk as a journal (aka raid5/6 log), so that after unclean > > shutdown we can make sure data and parity are consistent on the raid > > array by replaying the journal. > > > > it would be possible to have more information ? > - what is logged ? data, parity or data + parity ? Patch 5 has more details(sorry for not making it clear that in the cover letter). So both data and parity are logged so that while replaying the journal everything is written to whichever disk it should be written to. > - in the past I thought that it would be sufficient to log only the stripe > position involved by a RMW cycle, and then start a scrub on these stripes in > case of an unclean shutdown: do you think that it is feasible ? An unclean shutdown causes inconsistence between data and parity, so scrub won't help as it's not able to tell which one (data or parity) is valid. With nodatacow, we do overwrite, so RMW during unclean shutdown is not safe. With datacow, we don't do overwrite, but the following situation may happen, say we have a raid5 setup with 3 disks, the stripe length is 64k, so 1) write 64K --> now the raid layout is [64K data + 64K random + 64K parity] 2) write another 64K --> now the raid layout after RMW is [64K 1)'s data + 64K 2)'s data + 64K new parity] If unclean shutdown occurs before 2) finishes, then parity may be corrupted and then 1)'s data may be recovered wrongly if the disk which holds 1)'s data is offline. > - does this journal disk also host other btrfs log ? > No, purely data/parity and some associated metadata. Thanks, -liubo > > The idea and the code are similar to the write-through mode of md > > raid5-cache, so ppl(partial parity log) is also feasible to implement. > > (If you've been familiar with md, you may find this patch set is > > boring to read...) > > > > Patch 1-3 are about adding a log disk, patch 5-8 are the main part of > > the implementation, the rest patches are improvements and bugfixes, > > eg. readahead for recovery, checksum. > > > > Two btrfs-progs patches are required to play with this patch set, one > > is to enhance 'btrfs device add' to add a disk as raid5/6 log with the > > option '-L', the other is to teach 'btrfs-show-super' to show > > %journal_tail. > > > > This is currently based on 4.12-rc3. > > > > The patch set is tagged with RFC, and comments are always welcome, > > thanks. > > > > Known limitations: > > - Deleting a log device is not implemented yet. > > > > > > Liu Bo (14): > > Btrfs: raid56: add raid56 log via add_dev v2 ioctl > > Btrfs: raid56: do not allocate chunk on raid56 log > > Btrfs: raid56: detect raid56 log on mount > > Btrfs: raid56: add verbose debug > > Btrfs: raid56: add stripe log for raid5/6 > > Btrfs: raid56: add reclaim support > > Btrfs: raid56: load r5log > > Btrfs: raid56: log recovery > > Btrfs: raid56: add readahead for recovery > > Btrfs: raid56: use the readahead helper to get page > > Btrfs: raid56: add csum support > > Btrfs: raid56: fix error handling while adding a log device > > Btrfs: raid56: initialize raid5/6 log after adding it > > Btrfs: raid56: maintain IO order on raid5/6 log > > > > fs/btrfs/ctree.h| 16 +- > > fs/btrfs/disk-io.c | 16 + > > fs/btrfs/ioctl.c| 48 +- > > fs/btrfs/raid56.c | 1429 > > ++- > > fs/btrfs/raid56.h | 82 +++ > > fs/btrfs/transaction.c |2 + > > fs/btrfs/volumes.c | 56 +- > > fs/btrfs/volumes.h |7 +- > > include/uapi/linux/btrfs.h |3 + > > include/uapi/linux/btrfs_tree.h |4 + > > 10 files changed, 1487 insertions(+), 176 deletions(-) > > > > > -- > gpg @keyserver.linux.it: Goffredo Baroncelli > Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid0 rescue
On Thu, Jul 27, 2017 at 8:49 AM, Alan Brandwrote: > I know I am screwed but hope someone here can point at a possible solution. > > I had a pair of btrfs drives in a raid0 configuration. One of the > drives was pulled by mistake, put in a windows box, and a quick NTFS > format was done. Then much screaming occurred. > > I know the data is still there. Is there anyway to rebuild the raid > bringing in the bad disk? I know some info is still good, for example > metadata0 is corrupt but 1 and 2 are good. > The trees look bad which is probably the killer. Well the first step is to check and fix the super blocks. And then the normal code should just discover the bad stuff, and get good copies from the good drive, and copy them to the corrupt one, passively, and eventually fix the file system itself. There's probably only a few files corrupted irrecoverably. It's probably worth testing for this explicitly. It's not a wild scenario, and it's something Btrfs should be able to recover from gracefully. The gotcha part of a totally automatic recovery is the superblocks because there's no *one true right way* for the kernel to just assume the remaining Btrfs supers are more valid than the NTFS supers. So then the question is, which tool should fix this up? I'd say both 'btrfs rescue super-recover' and 'btrfs check' should do this. The difference being super-recover would fix only the supers, with kernel code doing passive fixups as problems are encountered once the fs is mounted. And 'check --repair' would fix supers and additionally fix missing metadata on the corrupt drive, using user space code with an unmounted system. Both should work, or at least both should be fail safe. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes
On Tue, Aug 01, 2017 at 10:56:39AM -0600, Liu Bo wrote: > On Tue, Aug 01, 2017 at 05:28:57PM +, Hugo Mills wrote: > >Hi, > > > >Great to see something addressing the write hole at last. > > > > On Tue, Aug 01, 2017 at 10:14:23AM -0600, Liu Bo wrote: > > > This aims to fix write hole issue on btrfs raid5/6 setup by adding a > > > separate disk as a journal (aka raid5/6 log), so that after unclean > > > shutdown we can make sure data and parity are consistent on the raid > > > array by replaying the journal. > > > >What's the behaviour of the FS if the log device dies during use? > > > > Error handling on IOs is still under construction (belongs to known > limitations). > > If the log device dies suddenly, I think we could skip the writeback > to backend raid arrays and follow the rule in btrfs, filp FS to > readonly as it may expose data loss. What do you think? I think the key thing for me is that the overall behaviour of the redundancy in the FS is not compromised by the logging solution. That is, the same guarantees still hold: For RAID-5, you can lose up to one device of the FS (*including* any log devices), and the FS will continue to operate normally, but degraded. For RAID-6, you can lose up to two devices without losing any capabilities of the FS. Dropping to read-only if the (single) log device fails would break those guarantees. I quite like the idea of embedding the log chunks into the allocated structure of the FS -- although as pointed out, this is probably going to need a new chunk type, and (to retain the guarantees of the RAID-6 behaviour above) the ability to do 3-way RAID-1 on those chunks. You'd also have to be able to balance the log structures while in flight. It sounds like a lot more work for you, though. Hmm... if 3-way RAID-1 (3c) is available, then you could also have RAID-1*3 on metadata, RAID-6 on data, and have 2-device redundancy throughout. That's also a very attractive configuration in many respects. (Analagous to RAID-1 metadata and RAID-5 data). Hugo. -- Hugo Mills | That's not rain, that's a lake with slots in it. hugo@... carfax.org.uk | http://carfax.org.uk/ | PGP: E2AB1DE4 | signature.asc Description: Digital signature
Re: Odd fallocate behavior on BTRFS.
On 08/01/17 19:34, Austin S. Hemmelgarn wrote: [..] > Apparently, if you call fallocate() on a file with an offset of 0 and > a length longer than the length of the file itself, BTRFS will > allocate that exact amount of space, instead of just filling in holes > in the file and allocating space to extend it. If there isn't enough > space on the filesystem for this, then it will fail, even though it > would succeed on ext4, XFS, and F2FS. [..] > I'm curious to hear anybody's thoughts on this, namely: 1. Is this > behavior that should be considered implementation defined? 2. If not, > is my assessment that BTRFS is behaving incorrectly in this case > accurate? IMHO no and yes, respectively. Both fallocate(2) and posix_fallocate(3) make it very clear that the expected default behaviour is to extend. I don't think this can be interpreted in any other way than incorrect behaviour on behalf of btrfs. Your script reproduces for me, so that's a start. -h -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs + compression = slow performance and high cpu usage
- Original Message - From: "Peter Grandi"To: "Linux fs Btrfs" Sent: Tuesday, 1 August, 2017 3:14:07 PM Subject: Re: Btrfs + compression = slow performance and high cpu usage > Peter, I don't think the filefrag is showing the correct > fragmentation status of the file when the compression is used. As I wrote, "their size is just limited by the compression code" which results in "128KiB writes". On a "fresh empty Btrfs volume" the compressed extents limited to 128KiB also happen to be pretty physically contiguous, but on a more fragmented free space list they can be more scattered. KOS: Ok, thanks for pointing it out. I have compared the filefrag -v on another btrfs that is not fragmented and see the difference with what is happening on the sluggish one. 5824: 186368.. 186399: 2430093383..2430093414: 32: 2430093414: encoded 5825: 186400.. 186431: 2430093384..2430093415: 32: 2430093415: encoded 5826: 186432.. 186463: 2430093385..2430093416: 32: 2430093416: encoded 5827: 186464.. 186495: 2430093386..2430093417: 32: 2430093417: encoded 5828: 186496.. 186527: 2430093387..2430093418: 32: 2430093418: encoded 5829: 186528.. 186559: 2430093388..2430093419: 32: 2430093419: encoded 5830: 186560.. 186591: 2430093389..2430093420: 32: 2430093420: encoded As I already wrote the main issue here seems to be that we are talking about a "RAID5 with 128KiB writes and a 768KiB stripe size". On MD RAID5 the slowdown because of RMW seems only to be around 30-40%, but it looks like that several back-to-back 128KiB writes get merged by the Linux IO subsystem (not sure whether that's thoroughly legal), and perhaps they get merged by the 3ware firmware only if it has a persistent cache, and maybe your 3ware does not have one, but you have kept your counsel as to that. KOS: No I don't have persistent cache. Only the 512 Mb cache on board of a controller, that is BBU. If I had additional SSD caching on the controller I would have mentioned it. I was also under impression, that in a situation where mostly extra large files will be stored on the massive, the bigger strip size would indeed increase the speed, thus I went with with the 256 Kb strip size. Would I be correct in assuming that the RAID strip size of 128 Kb will be a better choice if one plans to use the BTRFS with compression? thanks, kos -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes
On Tue, Aug 01, 2017 at 01:39:59PM -0400, Austin S. Hemmelgarn wrote: > On 2017-08-01 13:25, Roman Mamedov wrote: > > On Tue, 1 Aug 2017 10:14:23 -0600 > > Liu Bowrote: > > > > > This aims to fix write hole issue on btrfs raid5/6 setup by adding a > > > separate disk as a journal (aka raid5/6 log), so that after unclean > > > shutdown we can make sure data and parity are consistent on the raid > > > array by replaying the journal. > > > > Could it be possible to designate areas on the in-array devices to be used > > as > > journal? > > > > While md doesn't have much spare room in its metadata for extraneous things > > like this, Btrfs could use almost as much as it wants to, adding to size of > > the > > FS metadata areas. Reliability-wise, the log could be stored as RAID1 > > chunks. > > > > It doesn't seem convenient to need having an additional storage device > > around > > just for the log, and also needing to maintain its fault tolerance yourself > > (so > > the log device would better be on a mirror, such as mdadm RAID1? more > > expense > > and maintenance complexity). > > > I agree, MD pretty much needs a separate device simply because they can't > allocate arbitrary space on the other array members. BTRFS can do that > though, and I would actually think that that would be _easier_ to implement > than having a separate device. > Yes and no, using chunks may need a new ioctl and diving into chunk allocation/(auto)deletion maze. > That said, I do think that it would need to be a separate chunk type, > because things could get really complicated if the metadata is itself using > a parity raid profile. Exactly, esp. when balance comes into the picture. Thanks, -liubo -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: What are the typical usecase of "btrfs check --init-extent-tree"?
On Thu, Jul 27, 2017 at 9:33 AM, Ivan Sizovwrote: > I've just noticed a huge number of errors on one of the RAID's disks. > "btrfs dev stats" gives: > > [/dev/sdc1].write_io_errs0 > [/dev/sdc1].read_io_errs 305 > [/dev/sdc1].flush_io_errs0 > [/dev/sdc1].corruption_errs 429 > [/dev/sdc1].generation_errs 0 > > [/dev/sda1].write_io_errs58331 > [/dev/sda1].read_io_errs 57438 > [/dev/sda1].flush_io_errs37 > [/dev/sda1].corruption_errs 10110 > [/dev/sda1].generation_errs 0 You'll need to translate the sda device to ata device, and then do a search for kernel messages. I suspect persistent bad sector on this drive, and write failures are always disqualifying but Btrfs won't eject this device. Read errors are not a big problem. Read errors along with corruptions aren't necessarily a big problem. Write errors are a big problem. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes
On Tue, Aug 01, 2017 at 10:25:47PM +0500, Roman Mamedov wrote: > On Tue, 1 Aug 2017 10:14:23 -0600 > Liu Bowrote: > > > This aims to fix write hole issue on btrfs raid5/6 setup by adding a > > separate disk as a journal (aka raid5/6 log), so that after unclean > > shutdown we can make sure data and parity are consistent on the raid > > array by replaying the journal. > > Could it be possible to designate areas on the in-array devices to be used as > journal? > > While md doesn't have much spare room in its metadata for extraneous things > like this, Btrfs could use almost as much as it wants to, adding to size of > the > FS metadata areas. Reliability-wise, the log could be stored as RAID1 chunks. > Yes, it makes sense, we could definitely do that, that was actually the original idea. I started with adding a new device for log as it looks easier to me, but I could try that now. > It doesn't seem convenient to need having an additional storage device around > just for the log, and also needing to maintain its fault tolerance yourself > (so > the log device would better be on a mirror, such as mdadm RAID1? more expense > and maintenance complexity). > That's true. Thanks for the suggestions. Thanks, -liubo -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes
On Tue, Aug 01, 2017 at 05:28:57PM +, Hugo Mills wrote: >Hi, > >Great to see something addressing the write hole at last. > > On Tue, Aug 01, 2017 at 10:14:23AM -0600, Liu Bo wrote: > > This aims to fix write hole issue on btrfs raid5/6 setup by adding a > > separate disk as a journal (aka raid5/6 log), so that after unclean > > shutdown we can make sure data and parity are consistent on the raid > > array by replaying the journal. > >What's the behaviour of the FS if the log device dies during use? > Error handling on IOs is still under construction (belongs to known limitations). If the log device dies suddenly, I think we could skip the writeback to backend raid arrays and follow the rule in btrfs, filp FS to readonly as it may expose data loss. What do you think? Thanks, -liubo >Hugo. > > > The idea and the code are similar to the write-through mode of md > > raid5-cache, so ppl(partial parity log) is also feasible to implement. > > (If you've been familiar with md, you may find this patch set is > > boring to read...) > > > > Patch 1-3 are about adding a log disk, patch 5-8 are the main part of > > the implementation, the rest patches are improvements and bugfixes, > > eg. readahead for recovery, checksum. > > > > Two btrfs-progs patches are required to play with this patch set, one > > is to enhance 'btrfs device add' to add a disk as raid5/6 log with the > > option '-L', the other is to teach 'btrfs-show-super' to show > > %journal_tail. > > > > This is currently based on 4.12-rc3. > > > > The patch set is tagged with RFC, and comments are always welcome, > > thanks. > > > > Known limitations: > > - Deleting a log device is not implemented yet. > > > > > > Liu Bo (14): > > Btrfs: raid56: add raid56 log via add_dev v2 ioctl > > Btrfs: raid56: do not allocate chunk on raid56 log > > Btrfs: raid56: detect raid56 log on mount > > Btrfs: raid56: add verbose debug > > Btrfs: raid56: add stripe log for raid5/6 > > Btrfs: raid56: add reclaim support > > Btrfs: raid56: load r5log > > Btrfs: raid56: log recovery > > Btrfs: raid56: add readahead for recovery > > Btrfs: raid56: use the readahead helper to get page > > Btrfs: raid56: add csum support > > Btrfs: raid56: fix error handling while adding a log device > > Btrfs: raid56: initialize raid5/6 log after adding it > > Btrfs: raid56: maintain IO order on raid5/6 log > > > > fs/btrfs/ctree.h| 16 +- > > fs/btrfs/disk-io.c | 16 + > > fs/btrfs/ioctl.c| 48 +- > > fs/btrfs/raid56.c | 1429 > > ++- > > fs/btrfs/raid56.h | 82 +++ > > fs/btrfs/transaction.c |2 + > > fs/btrfs/volumes.c | 56 +- > > fs/btrfs/volumes.h |7 +- > > include/uapi/linux/btrfs.h |3 + > > include/uapi/linux/btrfs_tree.h |4 + > > 10 files changed, 1487 insertions(+), 176 deletions(-) > > > > -- > Hugo Mills | Some days, it's just not worth gnawing through the > hugo@... carfax.org.uk | straps > http://carfax.org.uk/ | > PGP: E2AB1DE4 | -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes
Hi Liu, On 2017-08-01 18:14, Liu Bo wrote: > This aims to fix write hole issue on btrfs raid5/6 setup by adding a > separate disk as a journal (aka raid5/6 log), so that after unclean > shutdown we can make sure data and parity are consistent on the raid > array by replaying the journal. > it would be possible to have more information ? - what is logged ? data, parity or data + parity ? - in the past I thought that it would be sufficient to log only the stripe position involved by a RMW cycle, and then start a scrub on these stripes in case of an unclean shutdown: do you think that it is feasible ? - does this journal disk also host other btrfs log ? > The idea and the code are similar to the write-through mode of md > raid5-cache, so ppl(partial parity log) is also feasible to implement. > (If you've been familiar with md, you may find this patch set is > boring to read...) > > Patch 1-3 are about adding a log disk, patch 5-8 are the main part of > the implementation, the rest patches are improvements and bugfixes, > eg. readahead for recovery, checksum. > > Two btrfs-progs patches are required to play with this patch set, one > is to enhance 'btrfs device add' to add a disk as raid5/6 log with the > option '-L', the other is to teach 'btrfs-show-super' to show > %journal_tail. > > This is currently based on 4.12-rc3. > > The patch set is tagged with RFC, and comments are always welcome, > thanks. > > Known limitations: > - Deleting a log device is not implemented yet. > > > Liu Bo (14): > Btrfs: raid56: add raid56 log via add_dev v2 ioctl > Btrfs: raid56: do not allocate chunk on raid56 log > Btrfs: raid56: detect raid56 log on mount > Btrfs: raid56: add verbose debug > Btrfs: raid56: add stripe log for raid5/6 > Btrfs: raid56: add reclaim support > Btrfs: raid56: load r5log > Btrfs: raid56: log recovery > Btrfs: raid56: add readahead for recovery > Btrfs: raid56: use the readahead helper to get page > Btrfs: raid56: add csum support > Btrfs: raid56: fix error handling while adding a log device > Btrfs: raid56: initialize raid5/6 log after adding it > Btrfs: raid56: maintain IO order on raid5/6 log > > fs/btrfs/ctree.h| 16 +- > fs/btrfs/disk-io.c | 16 + > fs/btrfs/ioctl.c| 48 +- > fs/btrfs/raid56.c | 1429 > ++- > fs/btrfs/raid56.h | 82 +++ > fs/btrfs/transaction.c |2 + > fs/btrfs/volumes.c | 56 +- > fs/btrfs/volumes.h |7 +- > include/uapi/linux/btrfs.h |3 + > include/uapi/linux/btrfs_tree.h |4 + > 10 files changed, 1487 insertions(+), 176 deletions(-) > -- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes
On 2017-08-01 13:25, Roman Mamedov wrote: On Tue, 1 Aug 2017 10:14:23 -0600 Liu Bowrote: This aims to fix write hole issue on btrfs raid5/6 setup by adding a separate disk as a journal (aka raid5/6 log), so that after unclean shutdown we can make sure data and parity are consistent on the raid array by replaying the journal. Could it be possible to designate areas on the in-array devices to be used as journal? While md doesn't have much spare room in its metadata for extraneous things like this, Btrfs could use almost as much as it wants to, adding to size of the FS metadata areas. Reliability-wise, the log could be stored as RAID1 chunks. It doesn't seem convenient to need having an additional storage device around just for the log, and also needing to maintain its fault tolerance yourself (so the log device would better be on a mirror, such as mdadm RAID1? more expense and maintenance complexity). I agree, MD pretty much needs a separate device simply because they can't allocate arbitrary space on the other array members. BTRFS can do that though, and I would actually think that that would be _easier_ to implement than having a separate device. That said, I do think that it would need to be a separate chunk type, because things could get really complicated if the metadata is itself using a parity raid profile. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Odd fallocate behavior on BTRFS.
A recent thread on the BTRFS mailing list [1] brought up some odd behavior in BTRFS that I've long suspected but not had prior reason to test. I've put the fsdevel mailing list on CC since I'm curious to hear what people there think about this. Apparently, if you call fallocate() on a file with an offset of 0 and a length longer than the length of the file itself, BTRFS will allocate that exact amount of space, instead of just filling in holes in the file and allocating space to extend it. If there isn't enough space on the filesystem for this, then it will fail, even though it would succeed on ext4, XFS, and F2FS. The following script demonstrates this: #!/bin/bash touch ./test-fs truncate --size=4G ./test-fs mkfs.btrfs ./test-fs mkdir ./test mount -t auto ./test-fs ./test dd if=/dev/zero of=./test/test bs=65536 count=32768 fallocate -l 2147483650 ./test/test && echo "Success!" umount ./test rmdir ./test rm -f ./test-fs This will spit out a -ENOSPC error from the fallocate call, but if you change the mkfs call to ext4, XFS, or F2FS, it will instead succeed without error. If the fallocate call is changed to `fallocate -o 2147483648 -l 2 ./test/test`, it will succeed on all filesystems. I have not yet done any testing to determine if this also applies for offsets other than 0, but I suspect it does (it would be kind of odd if it didn't). My thought on this is that the behavior that BTRFS exhibits is incorrect in this case, at a minimum because it does not follow the apparent de-facto standard, and because it keeps some things from working (the OP in the thread that resulted in me finding this was having issues trying to extend a SnapRAID parity file that was already larger than half the size of the BTRFS volume it was stored on). I'm curious to hear anybody's thoughts on this, namely: 1. Is this behavior that should be considered implementation defined? 2. If not, is my assessment that BTRFS is behaving incorrectly in this case accurate? [1] https://marc.info/?l=linux-btrfs=150158963921123=2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes
Hi, Great to see something addressing the write hole at last. On Tue, Aug 01, 2017 at 10:14:23AM -0600, Liu Bo wrote: > This aims to fix write hole issue on btrfs raid5/6 setup by adding a > separate disk as a journal (aka raid5/6 log), so that after unclean > shutdown we can make sure data and parity are consistent on the raid > array by replaying the journal. What's the behaviour of the FS if the log device dies during use? Hugo. > The idea and the code are similar to the write-through mode of md > raid5-cache, so ppl(partial parity log) is also feasible to implement. > (If you've been familiar with md, you may find this patch set is > boring to read...) > > Patch 1-3 are about adding a log disk, patch 5-8 are the main part of > the implementation, the rest patches are improvements and bugfixes, > eg. readahead for recovery, checksum. > > Two btrfs-progs patches are required to play with this patch set, one > is to enhance 'btrfs device add' to add a disk as raid5/6 log with the > option '-L', the other is to teach 'btrfs-show-super' to show > %journal_tail. > > This is currently based on 4.12-rc3. > > The patch set is tagged with RFC, and comments are always welcome, > thanks. > > Known limitations: > - Deleting a log device is not implemented yet. > > > Liu Bo (14): > Btrfs: raid56: add raid56 log via add_dev v2 ioctl > Btrfs: raid56: do not allocate chunk on raid56 log > Btrfs: raid56: detect raid56 log on mount > Btrfs: raid56: add verbose debug > Btrfs: raid56: add stripe log for raid5/6 > Btrfs: raid56: add reclaim support > Btrfs: raid56: load r5log > Btrfs: raid56: log recovery > Btrfs: raid56: add readahead for recovery > Btrfs: raid56: use the readahead helper to get page > Btrfs: raid56: add csum support > Btrfs: raid56: fix error handling while adding a log device > Btrfs: raid56: initialize raid5/6 log after adding it > Btrfs: raid56: maintain IO order on raid5/6 log > > fs/btrfs/ctree.h| 16 +- > fs/btrfs/disk-io.c | 16 + > fs/btrfs/ioctl.c| 48 +- > fs/btrfs/raid56.c | 1429 > ++- > fs/btrfs/raid56.h | 82 +++ > fs/btrfs/transaction.c |2 + > fs/btrfs/volumes.c | 56 +- > fs/btrfs/volumes.h |7 +- > include/uapi/linux/btrfs.h |3 + > include/uapi/linux/btrfs_tree.h |4 + > 10 files changed, 1487 insertions(+), 176 deletions(-) > -- Hugo Mills | Some days, it's just not worth gnawing through the hugo@... carfax.org.uk | straps http://carfax.org.uk/ | PGP: E2AB1DE4 | signature.asc Description: Digital signature
Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes
On Tue, 1 Aug 2017 10:14:23 -0600 Liu Bowrote: > This aims to fix write hole issue on btrfs raid5/6 setup by adding a > separate disk as a journal (aka raid5/6 log), so that after unclean > shutdown we can make sure data and parity are consistent on the raid > array by replaying the journal. Could it be possible to designate areas on the in-array devices to be used as journal? While md doesn't have much spare room in its metadata for extraneous things like this, Btrfs could use almost as much as it wants to, adding to size of the FS metadata areas. Reliability-wise, the log could be stored as RAID1 chunks. It doesn't seem convenient to need having an additional storage device around just for the log, and also needing to maintain its fault tolerance yourself (so the log device would better be on a mirror, such as mdadm RAID1? more expense and maintenance complexity). -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 12/14] Btrfs: raid56: fix error handling while adding a log device
Currently there is a memory leak if we have an error while adding a raid5/6 log. Moreover, it didn't abort the transaction as others do, so this is fixing the broken error handling by applying two steps on initializing the log, step #1 is to allocate memory, check if it has a proper size, and step #2 is to assign the pointer in %fs_info. And by running step #1 ahead of starting transaction, we can gracefully bail out on errors now. Signed-off-by: Liu Bo--- fs/btrfs/raid56.c | 48 +--- fs/btrfs/raid56.h | 5 + fs/btrfs/volumes.c | 36 ++-- 3 files changed, 68 insertions(+), 21 deletions(-) diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c index 8bc7ba4..0bfc97a 100644 --- a/fs/btrfs/raid56.c +++ b/fs/btrfs/raid56.c @@ -3711,30 +3711,64 @@ void raid56_submit_missing_rbio(struct btrfs_raid_bio *rbio) async_missing_raid56(rbio); } -int btrfs_set_r5log(struct btrfs_fs_info *fs_info, struct btrfs_device *device) +struct btrfs_r5l_log * btrfs_r5l_init_log_prepare(struct btrfs_fs_info *fs_info, struct btrfs_device *device, struct block_device *bdev) { - struct btrfs_r5l_log *log; - - log = kzalloc(sizeof(*log), GFP_NOFS); + int num_devices = fs_info->fs_devices->num_devices; + u64 dev_total_bytes; + struct btrfs_r5l_log *log = kzalloc(sizeof(struct btrfs_r5l_log), GFP_NOFS); if (!log) - return -ENOMEM; + return ERR_PTR(-ENOMEM); + + ASSERT(device); + ASSERT(bdev); + dev_total_bytes = i_size_read(bdev->bd_inode); /* see find_free_dev_extent for 1M start offset */ log->data_offset = 1024ull * 1024; - log->device_size = btrfs_device_get_total_bytes(device) - log->data_offset; + log->device_size = dev_total_bytes - log->data_offset; log->device_size = round_down(log->device_size, PAGE_SIZE); + + /* +* when device has been included in fs_devices, do not take +* into account this device when checking log size. +*/ + if (device->in_fs_metadata) + num_devices--; + + if (log->device_size < BTRFS_STRIPE_LEN * num_devices * 2) { + btrfs_info(fs_info, "r5log log device size (%llu < %llu) is too small", log->device_size, BTRFS_STRIPE_LEN * num_devices * 2); + kfree(log); + return ERR_PTR(-EINVAL); + } + log->dev = device; log->fs_info = fs_info; ASSERT(sizeof(device->uuid) == BTRFS_UUID_SIZE); log->uuid_csum = btrfs_crc32c(~0, device->uuid, sizeof(device->uuid)); mutex_init(>io_mutex); + return log; +} + +void btrfs_r5l_init_log_post(struct btrfs_fs_info *fs_info, struct btrfs_r5l_log *log) +{ cmpxchg(_info->r5log, NULL, log); ASSERT(fs_info->r5log == log); #ifdef BTRFS_DEBUG_R5LOG - trace_printk("r5log: set a r5log in fs_info, alloc_range 0x%llx 0x%llx", + trace_printk("r5log: set a r5log in fs_info, alloc_range 0x%llx 0x%llx\n", log->data_offset, log->data_offset + log->device_size); #endif +} + +int btrfs_set_r5log(struct btrfs_fs_info *fs_info, struct btrfs_device *device) +{ + struct btrfs_r5l_log *log; + + log = btrfs_r5l_init_log_prepare(fs_info, device, device->bdev); + if (IS_ERR(log)) + return PTR_ERR(log); + + btrfs_r5l_init_log_post(fs_info, log); return 0; } diff --git a/fs/btrfs/raid56.h b/fs/btrfs/raid56.h index 569cec8..f6d6f36 100644 --- a/fs/btrfs/raid56.h +++ b/fs/btrfs/raid56.h @@ -134,6 +134,11 @@ void raid56_submit_missing_rbio(struct btrfs_raid_bio *rbio); int btrfs_alloc_stripe_hash_table(struct btrfs_fs_info *info); void btrfs_free_stripe_hash_table(struct btrfs_fs_info *info); +struct btrfs_r5l_log * btrfs_r5l_init_log_prepare(struct btrfs_fs_info *fs_info, + struct btrfs_device *device, + struct block_device *bdev); +void btrfs_r5l_init_log_post(struct btrfs_fs_info *fs_info, +struct btrfs_r5l_log *log); int btrfs_set_r5log(struct btrfs_fs_info *fs_info, struct btrfs_device *device); int btrfs_r5l_load_log(struct btrfs_fs_info *fs_info, u64 cp); #endif diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index ac64d93..851c001 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -2327,6 +2327,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path int seeding_dev = 0; int ret = 0; bool is_r5log = (flags & BTRFS_DEVICE_RAID56_LOG); + struct btrfs_r5l_log *r5log = NULL; if (is_r5log) ASSERT(!fs_info->fs_devices->seeding); @@ -2367,6 +2368,15 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path goto error; } +
[PATCH 10/14] Btrfs: raid56: use the readahead helper to get page
This updates recovery code to use the readahead helper. Signed-off-by: Liu Bo--- fs/btrfs/raid56.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c index 24f7cbb..8f47e56 100644 --- a/fs/btrfs/raid56.c +++ b/fs/btrfs/raid56.c @@ -1608,7 +1608,9 @@ static int btrfs_r5l_recover_load_meta(struct btrfs_r5l_recover_ctx *ctx) { struct btrfs_r5l_meta_block *mb; - btrfs_r5l_sync_page_io(log, log->dev, (ctx->pos >> 9), PAGE_SIZE, ctx->meta_page, REQ_OP_READ); + ret = btrfs_r5l_recover_read_page(ctx, ctx->meta_page, ctx->pos); + if (ret) + return ret; mb = kmap(ctx->meta_page); #ifdef BTRFS_DEBUG_R5LOG -- 2.9.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 06/14] Btrfs: raid56: add reclaim support
The log space is limited, so reclaim is necessary when there is not enough space to use. By recording the largest position we've written to the log disk and flushing all disks' cache and the superblock, we can be sure that data and parity before this position have the identical copy in the log and raid5/6 array. Also we need to take care of the case when IOs get reordered. A list is used to keep the order right. Signed-off-by: Liu Bo--- fs/btrfs/ctree.h | 10 +++- fs/btrfs/raid56.c | 63 -- fs/btrfs/transaction.c | 2 ++ 3 files changed, 72 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index d967627..9235643 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -244,8 +244,10 @@ struct btrfs_super_block { __le64 cache_generation; __le64 uuid_tree_generation; + /* r5log journal tail (where recovery starts) */ + __le64 journal_tail; /* future expansion */ - __le64 reserved[30]; + __le64 reserved[29]; u8 sys_chunk_array[BTRFS_SYSTEM_CHUNK_ARRAY_SIZE]; struct btrfs_root_backup super_roots[BTRFS_NUM_BACKUP_ROOTS]; } __attribute__ ((__packed__)); @@ -2291,6 +2293,8 @@ BTRFS_SETGET_STACK_FUNCS(super_log_root_transid, struct btrfs_super_block, log_root_transid, 64); BTRFS_SETGET_STACK_FUNCS(super_log_root_level, struct btrfs_super_block, log_root_level, 8); +BTRFS_SETGET_STACK_FUNCS(super_journal_tail, struct btrfs_super_block, +journal_tail, 64); BTRFS_SETGET_STACK_FUNCS(super_total_bytes, struct btrfs_super_block, total_bytes, 64); BTRFS_SETGET_STACK_FUNCS(super_bytes_used, struct btrfs_super_block, @@ -3284,6 +3288,10 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options, unsigned long new_flags); int btrfs_sync_fs(struct super_block *sb, int wait); +/* raid56.c */ +void btrfs_r5l_write_journal_tail(struct btrfs_fs_info *fs_info); + + static inline __printf(2, 3) void btrfs_no_printk(const struct btrfs_fs_info *fs_info, const char *fmt, ...) { diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c index 007ba63..60010a6 100644 --- a/fs/btrfs/raid56.c +++ b/fs/btrfs/raid56.c @@ -191,6 +191,8 @@ struct btrfs_r5l_log { u64 data_offset; u64 device_size; + u64 next_checkpoint; + u64 last_checkpoint; u64 last_cp_seq; u64 seq; @@ -1231,11 +1233,14 @@ static void btrfs_r5l_log_endio(struct bio *bio) bio_put(bio); #ifdef BTRFS_DEBUG_R5LOG - trace_printk("move data to disk\n"); + trace_printk("move data to disk(current log->next_checkpoint %llu (will be %llu after writing to RAID\n", log->next_checkpoint, io->log_start); #endif /* move data to RAID. */ btrfs_write_rbio(io->rbio); + /* After stripe data has been flushed into raid, set ->next_checkpoint. */ + log->next_checkpoint = io->log_start; + if (log->current_io == io) log->current_io = NULL; btrfs_r5l_free_io_unit(log, io); @@ -1473,6 +1478,42 @@ static bool btrfs_r5l_has_free_space(struct btrfs_r5l_log *log, u64 size) } /* + * writing super with log->next_checkpoint + * + * This is protected by log->io_mutex. + */ +static void btrfs_r5l_write_super(struct btrfs_fs_info *fs_info, u64 cp) +{ + int ret; + +#ifdef BTRFS_DEBUG_R5LOG + trace_printk("r5l writing super to reclaim space, cp %llu\n", cp); +#endif + + btrfs_set_super_journal_tail(fs_info->super_for_commit, cp); + + /* +* flush all disk cache so that all data prior to +* %next_checkpoint lands on raid disks(recovery will start +* from %next_checkpoint). +*/ + ret = write_all_supers(fs_info, 1); + ASSERT(ret == 0); +} + +/* this is called by commit transaction and it's followed by writing super. */ +void btrfs_r5l_write_journal_tail(struct btrfs_fs_info *fs_info) +{ + if (fs_info->r5log) { + u64 cp = READ_ONCE(fs_info->r5log->next_checkpoint); + + trace_printk("journal_tail %llu\n", cp); + btrfs_set_super_journal_tail(fs_info->super_copy, cp); + WRITE_ONCE(fs_info->r5log->last_checkpoint, cp); + } +} + +/* * return 0 if data/parity are written into log and it will move data * to RAID in endio. * @@ -1535,7 +1576,25 @@ static int btrfs_r5l_write_stripe(struct btrfs_raid_bio *rbio) btrfs_r5l_log_stripe(log, data_pages, parity_pages, rbio); do_submit = true; } else { - ; /* XXX: reclaim */ +#ifdef BTRFS_DEBUG_R5LOG + trace_printk("r5log: no space log->last_checkpoint %llu log->log_start %llu log->next_checkpoint %llu\n", log->last_checkpoint, log->log_start, log->next_checkpoint); +#endif + + /* +*
[PATCH 01/14] Btrfs: raid56: add raid56 log via add_dev v2 ioctl
This introduces add_dev_v2 ioctl to add a device as raid56 journal device. With the help of a journal device, raid56 is able to to get rid of potential write holes. Signed-off-by: Liu Bo--- fs/btrfs/ctree.h| 6 ++ fs/btrfs/ioctl.c| 48 - fs/btrfs/raid56.c | 42 fs/btrfs/raid56.h | 1 + fs/btrfs/volumes.c | 26 -- fs/btrfs/volumes.h | 3 ++- include/uapi/linux/btrfs.h | 3 +++ include/uapi/linux/btrfs_tree.h | 4 8 files changed, 125 insertions(+), 8 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 643c70d..d967627 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -697,6 +697,7 @@ struct btrfs_stripe_hash_table { void btrfs_init_async_reclaim_work(struct work_struct *work); /* fs_info */ +struct btrfs_r5l_log; struct reloc_control; struct btrfs_device; struct btrfs_fs_devices; @@ -1114,6 +1115,9 @@ struct btrfs_fs_info { u32 nodesize; u32 sectorsize; u32 stripesize; + + /* raid56 log */ + struct btrfs_r5l_log *r5log; }; static inline struct btrfs_fs_info *btrfs_sb(struct super_block *sb) @@ -2932,6 +2936,8 @@ static inline int btrfs_need_cleaner_sleep(struct btrfs_fs_info *fs_info) static inline void free_fs_info(struct btrfs_fs_info *fs_info) { + if (fs_info->r5log) + kfree(fs_info->r5log); kfree(fs_info->balance_ctl); kfree(fs_info->delayed_root); kfree(fs_info->extent_root); diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index e176375..3d1ef4d 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -2653,6 +2653,50 @@ static int btrfs_ioctl_defrag(struct file *file, void __user *argp) return ret; } +/* identical to btrfs_ioctl_add_dev, but this is with flags */ +static long btrfs_ioctl_add_dev_v2(struct btrfs_fs_info *fs_info, void __user *arg) +{ + struct btrfs_ioctl_vol_args_v2 *vol_args; + int ret; + + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; + + if (test_and_set_bit(BTRFS_FS_EXCL_OP, _info->flags)) + return BTRFS_ERROR_DEV_EXCL_RUN_IN_PROGRESS; + + mutex_lock(_info->volume_mutex); + vol_args = memdup_user(arg, sizeof(*vol_args)); + if (IS_ERR(vol_args)) { + ret = PTR_ERR(vol_args); + goto out; + } + + if (vol_args->flags & BTRFS_DEVICE_RAID56_LOG && + fs_info->r5log) { + ret = -EEXIST; + btrfs_info(fs_info, "r5log: attempting to add another log device!"); + goto out_free; + } + + vol_args->name[BTRFS_PATH_NAME_MAX] = '\0'; + ret = btrfs_init_new_device(fs_info, vol_args->name, vol_args->flags); + if (!ret) { + if (vol_args->flags & BTRFS_DEVICE_RAID56_LOG) { + ASSERT(fs_info->r5log); + btrfs_info(fs_info, "disk added %s as raid56 log", vol_args->name); + } else { + btrfs_info(fs_info, "disk added %s", vol_args->name); + } + } +out_free: + kfree(vol_args); +out: + mutex_unlock(_info->volume_mutex); + clear_bit(BTRFS_FS_EXCL_OP, _info->flags); + return ret; +} + static long btrfs_ioctl_add_dev(struct btrfs_fs_info *fs_info, void __user *arg) { struct btrfs_ioctl_vol_args *vol_args; @@ -2672,7 +2716,7 @@ static long btrfs_ioctl_add_dev(struct btrfs_fs_info *fs_info, void __user *arg) } vol_args->name[BTRFS_PATH_NAME_MAX] = '\0'; - ret = btrfs_init_new_device(fs_info, vol_args->name); + ret = btrfs_init_new_device(fs_info, vol_args->name, 0); if (!ret) btrfs_info(fs_info, "disk added %s", vol_args->name); @@ -5539,6 +5583,8 @@ long btrfs_ioctl(struct file *file, unsigned int return btrfs_ioctl_resize(file, argp); case BTRFS_IOC_ADD_DEV: return btrfs_ioctl_add_dev(fs_info, argp); + case BTRFS_IOC_ADD_DEV_V2: + return btrfs_ioctl_add_dev_v2(fs_info, argp); case BTRFS_IOC_RM_DEV: return btrfs_ioctl_rm_dev(file, argp); case BTRFS_IOC_RM_DEV_V2: diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c index d8ea0eb..2b91b95 100644 --- a/fs/btrfs/raid56.c +++ b/fs/btrfs/raid56.c @@ -177,6 +177,25 @@ struct btrfs_raid_bio { unsigned long *dbitmap; }; +/* raid56 log */ +struct btrfs_r5l_log { + /* protect this struct and log io */ + struct mutex io_mutex; + + /* r5log device */ + struct btrfs_device *dev; + + /* allocation range for log entries */ + u64 data_offset; + u64 device_size; + + u64 last_checkpoint; + u64 last_cp_seq; + u64 seq; + u64 log_start; + struct btrfs_r5l_io_unit *current_io;
[PATCH 09/14] Btrfs: raid56: add readahead for recovery
While doing recovery, blocks are read from the raid5/6 disk one by one, so this is adding readahead so that we can read at most 256 contiguous blocks in one read IO. Signed-off-by: Liu Bo--- fs/btrfs/raid56.c | 114 +++--- 1 file changed, 109 insertions(+), 5 deletions(-) diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c index dea33c4..24f7cbb 100644 --- a/fs/btrfs/raid56.c +++ b/fs/btrfs/raid56.c @@ -1530,15 +1530,81 @@ static int btrfs_r5l_write_empty_meta_block(struct btrfs_r5l_log *log, u64 pos, return ret; } +#define BTRFS_R5L_RECOVER_IO_POOL_SIZE BIO_MAX_PAGES struct btrfs_r5l_recover_ctx { u64 pos; u64 seq; u64 total_size; struct page *meta_page; struct page *io_page; + + struct page *ra_pages[BTRFS_R5L_RECOVER_IO_POOL_SIZE]; + struct bio *ra_bio; + int total; + int valid; + u64 start_offset; + + struct btrfs_r5l_log *log; }; -static int btrfs_r5l_recover_load_meta(struct btrfs_r5l_log *log, struct btrfs_r5l_recover_ctx *ctx) +static int btrfs_r5l_recover_read_ra(struct btrfs_r5l_recover_ctx *ctx, u64 offset) +{ + bio_reset(ctx->ra_bio); + ctx->ra_bio->bi_bdev = ctx->log->dev->bdev; + ctx->ra_bio->bi_opf = REQ_OP_READ; + ctx->ra_bio->bi_iter.bi_sector = (ctx->log->data_offset + offset) >> 9; + + ctx->valid = 0; + ctx->start_offset = offset; + + while (ctx->valid < ctx->total) { + bio_add_page(ctx->ra_bio, ctx->ra_pages[ctx->valid++], PAGE_SIZE, 0); + + offset = btrfs_r5l_ring_add(ctx->log, offset, PAGE_SIZE); + if (offset == 0) + break; + } + +#ifdef BTRFS_DEBUG_R5LOG + trace_printk("to read %d pages starting from 0x%llx\n", ctx->valid, ctx->log->data_offset + ctx->start_offset); +#endif + return submit_bio_wait(ctx->ra_bio); +} + +static int btrfs_r5l_recover_read_page(struct btrfs_r5l_recover_ctx *ctx, struct page *page, u64 offset) +{ + struct page *tmp; + int index; + char *src; + char *dst; + int ret; + + if (offset < ctx->start_offset || offset >= (ctx->start_offset + ctx->valid * PAGE_SIZE)) { + ret = btrfs_r5l_recover_read_ra(ctx, offset); + if (ret) + return ret; + } + +#ifdef BTRFS_DEBUG_R5LOG + trace_printk("offset 0x%llx start->offset 0x%llx ctx->valid %d\n", offset, ctx->start_offset, ctx->valid); +#endif + + ASSERT(IS_ALIGNED(ctx->start_offset, PAGE_SIZE)); + ASSERT(IS_ALIGNED(offset, PAGE_SIZE)); + + index = (offset - ctx->start_offset) >> PAGE_SHIFT; + ASSERT(index < ctx->valid); + + tmp = ctx->ra_pages[index]; + src = kmap(tmp); + dst = kmap(page); + memcpy(dst, src, PAGE_SIZE); + kunmap(page); + kunmap(tmp); + return 0; +} + +static int btrfs_r5l_recover_load_meta(struct btrfs_r5l_recover_ctx *ctx) { struct btrfs_r5l_meta_block *mb; @@ -1642,6 +1708,42 @@ static int btrfs_r5l_recover_flush_log(struct btrfs_r5l_log *log, struct btrfs_r } return ret; + +static int btrfs_r5l_recover_allocate_ra(struct btrfs_r5l_recover_ctx *ctx) +{ + struct page *page; + ctx->ra_bio = btrfs_io_bio_alloc(GFP_NOFS, BIO_MAX_PAGES); + + ctx->total = 0; + ctx->valid = 0; + while (ctx->total < BTRFS_R5L_RECOVER_IO_POOL_SIZE) { + page = alloc_page(GFP_NOFS | __GFP_HIGHMEM); + if (!page) + break; + + ctx->ra_pages[ctx->total++] = page; + } + + if (ctx->total == 0) { + bio_put(ctx->ra_bio); + return -ENOMEM; + } + +#ifdef BTRFS_DEBUG_R5LOG + trace_printk("readahead: %d allocated pages\n", ctx->total); +#endif + return 0; +} + +static void btrfs_r5l_recover_free_ra(struct btrfs_r5l_recover_ctx *ctx) +{ + int i; +#ifdef BTRFS_DEBUG_R5LOG + trace_printk("readahead: %d to free pages\n", ctx->total); +#endif + for (i = 0; i < ctx->total; i++) + __free_page(ctx->ra_pages[i]); + bio_put(ctx->ra_bio); } static void btrfs_r5l_write_super(struct btrfs_fs_info *fs_info, u64 cp); @@ -1655,6 +1757,7 @@ static int btrfs_r5l_recover_log(struct btrfs_r5l_log *log) ctx = kzalloc(sizeof(*ctx), GFP_NOFS); ASSERT(ctx); + ctx->log = log; ctx->pos = log->last_checkpoint; ctx->seq = log->last_cp_seq; ctx->meta_page = alloc_page(GFP_NOFS | __GFP_HIGHMEM); @@ -1662,10 +1765,10 @@ static int btrfs_r5l_recover_log(struct btrfs_r5l_log *log) ctx->io_page = alloc_page(GFP_NOFS | __GFP_HIGHMEM); ASSERT(ctx->io_page); - ret = btrfs_r5l_recover_flush_log(log, ctx); - if (ret) { - ; - } + ret = btrfs_r5l_recover_allocate_ra(ctx); + ASSERT(ret == 0); + +
[PATCH 14/14] Btrfs: raid56: maintain IO order on raid5/6 log
A typical write to the raid5/6 log needs three steps: 1) collect data/parity pages into the bio in io_unit; 2) submit the bio in io_unit; 3) writeback data/parity to raid array in end_io. 1) and 2) are protected within log->io_mutex, while 3) is not. Since recovery needs to know the checkpoint offset where the highest successful writeback is, we cannot allow IO to be reordered. This is adding a list in which IO order is maintained properly. Signed-off-by: Liu Bo--- fs/btrfs/raid56.c | 42 ++ fs/btrfs/raid56.h | 5 + 2 files changed, 39 insertions(+), 8 deletions(-) diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c index b771d7d..ceca415 100644 --- a/fs/btrfs/raid56.c +++ b/fs/btrfs/raid56.c @@ -183,6 +183,9 @@ struct btrfs_r5l_log { /* protect this struct and log io */ struct mutex io_mutex; + spinlock_t io_list_lock; + struct list_head io_list; + /* r5log device */ struct btrfs_device *dev; @@ -1205,6 +1208,7 @@ static struct btrfs_r5l_io_unit *btrfs_r5l_alloc_io_unit(struct btrfs_r5l_log *l static void btrfs_r5l_free_io_unit(struct btrfs_r5l_log *log, struct btrfs_r5l_io_unit *io) { __free_page(io->meta_page); + ASSERT(list_empty(>list)); kfree(io); } @@ -1225,6 +1229,27 @@ static void btrfs_r5l_reserve_log_entry(struct btrfs_r5l_log *log, struct btrfs_ io->need_split_bio = true; } +/* the IO order is maintained in log->io_list. */ +static void btrfs_r5l_finish_io(struct btrfs_r5l_log *log) +{ + struct btrfs_r5l_io_unit *io, *next; + + spin_lock(>io_list_lock); + list_for_each_entry_safe(io, next, >io_list, list) { + if (io->status != BTRFS_R5L_STRIPE_END) + break; + +#ifdef BTRFS_DEBUG_R5LOG + trace_printk("current log->next_checkpoint %llu (will be %llu after writing to RAID\n", log->next_checkpoint, io->log_start); +#endif + + list_del_init(>list); + log->next_checkpoint = io->log_start; + btrfs_r5l_free_io_unit(log, io); + } + spin_unlock(>io_list_lock); +} + static void btrfs_write_rbio(struct btrfs_raid_bio *rbio); static void btrfs_r5l_log_endio(struct bio *bio) @@ -1234,18 +1259,12 @@ static void btrfs_r5l_log_endio(struct bio *bio) bio_put(bio); -#ifdef BTRFS_DEBUG_R5LOG - trace_printk("move data to disk(current log->next_checkpoint %llu (will be %llu after writing to RAID\n", log->next_checkpoint, io->log_start); -#endif /* move data to RAID. */ btrfs_write_rbio(io->rbio); + io->status = BTRFS_R5L_STRIPE_END; /* After stripe data has been flushed into raid, set ->next_checkpoint. */ - log->next_checkpoint = io->log_start; - - if (log->current_io == io) - log->current_io = NULL; - btrfs_r5l_free_io_unit(log, io); + btrfs_r5l_finish_io(log); } static struct bio *btrfs_r5l_bio_alloc(struct btrfs_r5l_log *log) @@ -1299,6 +1318,11 @@ static struct btrfs_r5l_io_unit *btrfs_r5l_new_meta(struct btrfs_r5l_log *log) bio_add_page(io->current_bio, io->meta_page, PAGE_SIZE, 0); btrfs_r5l_reserve_log_entry(log, io); + + INIT_LIST_HEAD(>list); + spin_lock(>io_list_lock); + list_add_tail(>list, >io_list); + spin_unlock(>io_list_lock); return io; } @@ -3760,6 +3784,8 @@ struct btrfs_r5l_log * btrfs_r5l_init_log_prepare(struct btrfs_fs_info *fs_info, ASSERT(sizeof(device->uuid) == BTRFS_UUID_SIZE); log->uuid_csum = btrfs_crc32c(~0, device->uuid, sizeof(device->uuid)); mutex_init(>io_mutex); + spin_lock_init(>io_list_lock); + INIT_LIST_HEAD(>io_list); return log; } diff --git a/fs/btrfs/raid56.h b/fs/btrfs/raid56.h index 2cc64a3..fc4ff20 100644 --- a/fs/btrfs/raid56.h +++ b/fs/btrfs/raid56.h @@ -43,11 +43,16 @@ static inline int nr_data_stripes(struct map_lookup *map) struct btrfs_r5l_log; #define BTRFS_R5LOG_MAGIC 0x6433c509 +#define BTRFS_R5L_STRIPE_END 1 + /* one meta block + several data + parity blocks */ struct btrfs_r5l_io_unit { struct btrfs_r5l_log *log; struct btrfs_raid_bio *rbio; + struct list_head list; + int status; + /* store meta block */ struct page *meta_page; -- 2.9.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] Btrfs-progs: add option to add raid5/6 log device
This introduces an option for 'btrfs device add' to add a device as raid5/6 log at run time. Signed-off-by: Liu Bo--- cmds-device.c | 30 +- ioctl.h | 3 +++ 2 files changed, 28 insertions(+), 5 deletions(-) diff --git a/cmds-device.c b/cmds-device.c index 4337eb2..ec6037e 100644 --- a/cmds-device.c +++ b/cmds-device.c @@ -45,6 +45,7 @@ static const char * const cmd_device_add_usage[] = { "Add a device to a filesystem", "-K|--nodiscarddo not perform whole device TRIM", "-f|--forceforce overwrite existing filesystem on the disk", + "-L|--r5logadd a disk as raid56 log", NULL }; @@ -55,6 +56,7 @@ static int cmd_device_add(int argc, char **argv) DIR *dirstream = NULL; int discard = 1; int force = 0; + int for_r5log = 0; int last_dev; while (1) { @@ -62,10 +64,11 @@ static int cmd_device_add(int argc, char **argv) static const struct option long_options[] = { { "nodiscard", optional_argument, NULL, 'K'}, { "force", no_argument, NULL, 'f'}, + { "r5log", no_argument, NULL, 'L'}, { NULL, 0, NULL, 0} }; - c = getopt_long(argc, argv, "Kf", long_options, NULL); + c = getopt_long(argc, argv, "KfL", long_options, NULL); if (c < 0) break; switch (c) { @@ -75,6 +78,9 @@ static int cmd_device_add(int argc, char **argv) case 'f': force = 1; break; + case 'L': + for_r5log = 1; + break; default: usage(cmd_device_add_usage); } @@ -83,6 +89,9 @@ static int cmd_device_add(int argc, char **argv) if (check_argc_min(argc - optind, 2)) usage(cmd_device_add_usage); + if (for_r5log && check_argc_max(argc - optind, 2)) + usage(cmd_device_add_usage); + last_dev = argc - 1; mntpnt = argv[last_dev]; @@ -91,7 +100,6 @@ static int cmd_device_add(int argc, char **argv) return 1; for (i = optind; i < last_dev; i++){ - struct btrfs_ioctl_vol_args ioctl_args; int devfd, res; u64 dev_block_count = 0; char *path; @@ -126,9 +134,21 @@ static int cmd_device_add(int argc, char **argv) goto error_out; } - memset(_args, 0, sizeof(ioctl_args)); - strncpy_null(ioctl_args.name, path); - res = ioctl(fdmnt, BTRFS_IOC_ADD_DEV, _args); + if (!for_r5log) { + struct btrfs_ioctl_vol_args ioctl_args; + + memset(_args, 0, sizeof(ioctl_args)); + strncpy_null(ioctl_args.name, path); + res = ioctl(fdmnt, BTRFS_IOC_ADD_DEV, _args); + } else { + /* apply v2 args format */ + struct btrfs_ioctl_vol_args_v2 ioctl_args; + + memset(_args, 0, sizeof(ioctl_args)); + strncpy_null(ioctl_args.name, path); + ioctl_args.flags |= BTRFS_DEVICE_RAID56_LOG; + res = ioctl(fdmnt, BTRFS_IOC_ADD_DEV_V2, _args); + } if (res < 0) { error("error adding device '%s': %s", path, strerror(errno)); diff --git a/ioctl.h b/ioctl.h index 709e996..748a7af 100644 --- a/ioctl.h +++ b/ioctl.h @@ -53,6 +53,7 @@ BUILD_ASSERT(sizeof(struct btrfs_ioctl_vol_args) == 4096); #define BTRFS_SUBVOL_RDONLY(1ULL << 1) #define BTRFS_SUBVOL_QGROUP_INHERIT(1ULL << 2) #define BTRFS_DEVICE_SPEC_BY_ID(1ULL << 3) +#define BTRFS_DEVICE_RAID56_LOG(1ULL << 4) #define BTRFS_VOL_ARG_V2_FLAGS_SUPPORTED \ (BTRFS_SUBVOL_CREATE_ASYNC |\ @@ -828,6 +829,8 @@ static inline char *btrfs_err_str(enum btrfs_err_code err_code) struct btrfs_ioctl_feature_flags[3]) #define BTRFS_IOC_RM_DEV_V2_IOW(BTRFS_IOCTL_MAGIC, 58, \ struct btrfs_ioctl_vol_args_v2) +#define BTRFS_IOC_ADD_DEV_V2 _IOW(BTRFS_IOCTL_MAGIC, 59, \ + struct btrfs_ioctl_vol_args_v2) #ifdef __cplusplus } #endif -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 07/14] Btrfs: raid56: load r5log
A raid5/6 log can be loaded while mounting a btrfs (which already has a disk set up as raid5/6 log) or setting up a disk as raid5/6 log for the first time. It gets %journal_tail from super_block where it can read the first 4K block and goes through the sanity checks, if it's valid, then go check if anything needs to be replayed, otherwise it creates a new empty block at the beginning of the disk and new writes will append to it. Signed-off-by: Liu Bo--- fs/btrfs/disk-io.c | 16 +++ fs/btrfs/raid56.c | 128 + fs/btrfs/raid56.h | 1 + 3 files changed, 145 insertions(+) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 8685d67..c2d8697 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -2987,6 +2987,22 @@ int open_ctree(struct super_block *sb, fs_info->generation = generation; fs_info->last_trans_committed = generation; + if (fs_info->r5log) { + u64 cp = btrfs_super_journal_tail(fs_info->super_copy); +#ifdef BTRFS_DEBUG_R5LOG + trace_printk("%s: get journal_tail %llu\n", __func__, cp); +#endif + /* if the data is not replayed, data and parity on +* disk are still consistent. So we can move on. +* +* About fsync, since fsync can make sure data is +* flushed onto disk and only metadata is kept into +* write-ahead log, the fsync'd data will never ends +* up with being replayed by raid56 log. +*/ + btrfs_r5l_load_log(fs_info, cp); + } + ret = btrfs_recover_balance(fs_info); if (ret) { btrfs_err(fs_info, "failed to recover balance: %d", ret); diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c index 60010a6..5d7ea235 100644 --- a/fs/btrfs/raid56.c +++ b/fs/btrfs/raid56.c @@ -1477,6 +1477,134 @@ static bool btrfs_r5l_has_free_space(struct btrfs_r5l_log *log, u64 size) return log->device_size > (used_size + size); } +static int btrfs_r5l_sync_page_io(struct btrfs_r5l_log *log, + struct btrfs_device *dev, sector_t sector, + int size, struct page *page, int op) +{ + struct bio *bio = btrfs_io_bio_alloc(GFP_NOFS, 1); + int ret; + + bio->bi_bdev = dev->bdev; + bio->bi_opf = op; + if (dev == log->dev) + bio->bi_iter.bi_sector = (log->data_offset >> 9) + sector; + else + bio->bi_iter.bi_sector = sector; + +#ifdef BTRFS_DEBUG_R5LOG + trace_printk("%s: op %d bi_sector 0x%llx\n", __func__, op, (bio->bi_iter.bi_sector << 9)); +#endif + + bio_add_page(bio, page, size, 0); + submit_bio_wait(bio); + ret = !bio->bi_error; + bio_put(bio); + return ret; +} + +static int btrfs_r5l_write_empty_meta_block(struct btrfs_r5l_log *log, u64 pos, u64 seq) +{ + struct page *page; + struct btrfs_r5l_meta_block *mb; + int ret = 0; + +#ifdef BTRFS_DEBUG_R5LOG + trace_printk("%s: pos %llu seq %llu\n", __func__, pos, seq); +#endif + + page = alloc_page(GFP_NOFS | __GFP_HIGHMEM | __GFP_ZERO); + ASSERT(page); + + mb = kmap(page); + mb->magic = cpu_to_le32(BTRFS_R5LOG_MAGIC); + mb->meta_size = cpu_to_le32(sizeof(struct btrfs_r5l_meta_block)); + mb->seq = cpu_to_le64(seq); + mb->position = cpu_to_le64(pos); + kunmap(page); + + if (!btrfs_r5l_sync_page_io(log, log->dev, (pos >> 9), PAGE_SIZE, page, REQ_OP_WRITE | REQ_FUA)) { + ret = -EIO; + } + + __free_page(page); + return ret; +} + +static void btrfs_r5l_write_super(struct btrfs_fs_info *fs_info, u64 cp); + +static int btrfs_r5l_recover_log(struct btrfs_r5l_log *log) +{ + return 0; +} + +/* return 0 if success, otherwise return errors */ +int btrfs_r5l_load_log(struct btrfs_fs_info *fs_info, u64 cp) +{ + struct btrfs_r5l_log *log = fs_info->r5log; + struct page *page; + struct btrfs_r5l_meta_block *mb; + bool create_new = false; + + ASSERT(log); + + page = alloc_page(GFP_NOFS | __GFP_HIGHMEM); + ASSERT(page); + + if (!btrfs_r5l_sync_page_io(log, log->dev, (cp >> 9), PAGE_SIZE, page, + REQ_OP_READ)) { + __free_page(page); + return -EIO; + } + + mb = kmap(page); +#ifdef BTRFS_DEBUG_R5LOG + trace_printk("r5l: mb->pos %llu cp %llu mb->seq %llu\n", le64_to_cpu(mb->position), cp, le64_to_cpu(mb->seq)); +#endif + + if (le32_to_cpu(mb->magic) != BTRFS_R5LOG_MAGIC) { +#ifdef BTRFS_DEBUG_R5LOG + trace_printk("magic not match: create new r5l\n"); +#endif + create_new = true; + goto create; + } + + ASSERT(le64_to_cpu(mb->position) == cp); + if (le64_to_cpu(mb->position) != cp) { +#ifdef
[PATCH 05/14] Btrfs: raid56: add stripe log for raid5/6
This is adding the ability to use a disk as raid5/6's stripe log (aka journal), the primary goal is to fix write hole issue that is inherent in raid56 setup. In a typical raid5/6 setup, both full stripe write and a partial stripe write will generate parity at the very end of writing, so after parity is generated, it's the right time to issue writes. Now with raid5/6's stripe log, every write will be put into the stripe log prior to being written to raid5/6 array, so that we have everything to rewrite all 'not-yet-on-disk' data/parity if a power loss happens while writing data/parity to different disks in raid5/6 array. A metadata block is used here to manage the information of data and parity and it's placed ahead of data and parity on stripe log. Right now such metadata block is limited to one page size and the structure is defined as {metadata block} + {a few payloads} - 'metadata block' contains a magic code, a sequence number and the start position on the stripe log. - 'payload' contains the information about data and parity, e.g. the physical offset and device id where data/parity is supposed to be. Each data block has a payload while each set of parity has a payload (e.g. for raid6, parity p and q has their own payload respectively). And we treat data and parity differently because btrfs always prepares the whole stripe length(64k) of parity, but data may only come from a partial stripe write. This metadata block is written to the raid5/6 stripe log with data/parity in a single bio(could be two bios, doesn't support more than two bios). Signed-off-by: Liu Bo--- fs/btrfs/raid56.c | 512 +++--- fs/btrfs/raid56.h | 65 +++ 2 files changed, 513 insertions(+), 64 deletions(-) diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c index c75766f..007ba63 100644 --- a/fs/btrfs/raid56.c +++ b/fs/btrfs/raid56.c @@ -185,6 +185,8 @@ struct btrfs_r5l_log { /* r5log device */ struct btrfs_device *dev; + struct btrfs_fs_info *fs_info; + /* allocation range for log entries */ u64 data_offset; u64 device_size; @@ -1179,6 +1181,445 @@ static void index_rbio_pages(struct btrfs_raid_bio *rbio) spin_unlock_irq(>bio_list_lock); } +/* r5log */ +/* XXX: this allocation may be done earlier, eg. when allocating rbio */ +static struct btrfs_r5l_io_unit *btrfs_r5l_alloc_io_unit(struct btrfs_r5l_log *log) +{ + struct btrfs_r5l_io_unit *io; + gfp_t gfp = GFP_NOFS; + + io = kzalloc(sizeof(*io), gfp); + ASSERT(io); + io->log = log; + /* need to use kmap. */ + io->meta_page = alloc_page(gfp | __GFP_HIGHMEM | __GFP_ZERO); + ASSERT(io->meta_page); + + return io; +} + +static void btrfs_r5l_free_io_unit(struct btrfs_r5l_log *log, struct btrfs_r5l_io_unit *io) +{ + __free_page(io->meta_page); + kfree(io); +} + +static u64 btrfs_r5l_ring_add(struct btrfs_r5l_log *log, u64 start, u64 inc) +{ + start += inc; + if (start >= log->device_size) + start = start - log->device_size; + return start; +} + +static void btrfs_r5l_reserve_log_entry(struct btrfs_r5l_log *log, struct btrfs_r5l_io_unit *io) +{ + log->log_start = btrfs_r5l_ring_add(log, log->log_start, PAGE_SIZE); + io->log_end = log->log_start; + + if (log->log_start == 0) + io->need_split_bio = true; +} + +static void btrfs_write_rbio(struct btrfs_raid_bio *rbio); + +static void btrfs_r5l_log_endio(struct bio *bio) +{ + struct btrfs_r5l_io_unit *io = bio->bi_private; + struct btrfs_r5l_log *log = io->log; + + bio_put(bio); + +#ifdef BTRFS_DEBUG_R5LOG + trace_printk("move data to disk\n"); +#endif + /* move data to RAID. */ + btrfs_write_rbio(io->rbio); + + if (log->current_io == io) + log->current_io = NULL; + btrfs_r5l_free_io_unit(log, io); +} + +static struct bio *btrfs_r5l_bio_alloc(struct btrfs_r5l_log *log) +{ + /* this allocation will not fail. */ + struct bio *bio = btrfs_io_bio_alloc(GFP_NOFS, BIO_MAX_PAGES); + + /* We need to make sure data/parity are settled down on the log disk. */ + bio->bi_opf = REQ_OP_WRITE | REQ_PREFLUSH | REQ_FUA; + bio->bi_bdev = log->dev->bdev; + +#ifdef BTRFS_DEBUG_R5LOG + trace_printk("log->data_offset 0x%llx log->log_start 0x%llx\n", log->data_offset, log->log_start); +#endif + bio->bi_iter.bi_sector = (log->data_offset + log->log_start) >> 9; + + return bio; +} + +static struct btrfs_r5l_io_unit *btrfs_r5l_new_meta(struct btrfs_r5l_log *log) +{ + struct btrfs_r5l_io_unit *io; + struct btrfs_r5l_meta_block *block; + + io = btrfs_r5l_alloc_io_unit(log); + ASSERT(io); + + block = kmap(io->meta_page); + clear_page(block); + +#ifdef BTRFS_DEBUG_R5LOG + trace_printk("%s pos %llu seq %llu\n", __func__, log->log_start,
[PATCH 08/14] Btrfs: raid56: log recovery
This is adding recovery on raid5/6 log. We've set a %journal_tail in super_block, which indicates the position from where we need to replay data. So we scan the log and replay valid meta/data/parity pairs until finding an invalid one. By replaying, it simply reads data/parity from the raid5/6 log and issues writes to the raid disks where it should be. Please note that the whole meta/data/parity pair can be discarded if it fails the sanity check in the meta block. After recovery, we also append an empty meta block and update the %journal_tail in super_block in order to avoid a situation, where the layout on the raid5/6 log is [valid A][invalid B][valid C], so block A is the only one we should replay. Then the recovery ends up pointing to block A as block B is invalid, and some new writes come in and append to block A so that block B is now overwritten to be a valid meta/data/parity. If a power loss happens, the new recovery starts again from block A, and since block B is now valid, it may replay block C as well which has become stale. Signed-off-by: Liu Bo--- fs/btrfs/raid56.c | 151 ++ 1 file changed, 151 insertions(+) diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c index 5d7ea235..dea33c4 100644 --- a/fs/btrfs/raid56.c +++ b/fs/btrfs/raid56.c @@ -1530,10 +1530,161 @@ static int btrfs_r5l_write_empty_meta_block(struct btrfs_r5l_log *log, u64 pos, return ret; } +struct btrfs_r5l_recover_ctx { + u64 pos; + u64 seq; + u64 total_size; + struct page *meta_page; + struct page *io_page; +}; + +static int btrfs_r5l_recover_load_meta(struct btrfs_r5l_log *log, struct btrfs_r5l_recover_ctx *ctx) +{ + struct btrfs_r5l_meta_block *mb; + + btrfs_r5l_sync_page_io(log, log->dev, (ctx->pos >> 9), PAGE_SIZE, ctx->meta_page, REQ_OP_READ); + + mb = kmap(ctx->meta_page); +#ifdef BTRFS_DEBUG_R5LOG + trace_printk("ctx->pos %llu ctx->seq %llu pos %llu seq %llu\n", ctx->pos, ctx->seq, le64_to_cpu(mb->position), le64_to_cpu(mb->seq)); +#endif + + if (le32_to_cpu(mb->magic) != BTRFS_R5LOG_MAGIC || + le64_to_cpu(mb->position) != ctx->pos || + le64_to_cpu(mb->seq) != ctx->seq) { +#ifdef BTRFS_DEBUG_R5LOG + trace_printk("%s: mismatch magic %llu default %llu\n", __func__, le32_to_cpu(mb->magic), BTRFS_R5LOG_MAGIC); +#endif + return -EINVAL; + } + + ASSERT(le32_to_cpu(mb->meta_size) <= PAGE_SIZE); + kunmap(ctx->meta_page); + + /* meta_block */ + ctx->total_size = PAGE_SIZE; + + return 0; +} + +static int btrfs_r5l_recover_load_data(struct btrfs_r5l_log *log, struct btrfs_r5l_recover_ctx *ctx) +{ + u64 offset; + struct btrfs_r5l_meta_block *mb; + u64 meta_size; + u64 io_offset; + struct btrfs_device *dev; + + mb = kmap(ctx->meta_page); + + io_offset = PAGE_SIZE; + offset = sizeof(struct btrfs_r5l_meta_block); + meta_size = le32_to_cpu(mb->meta_size); + + while (offset < meta_size) { + struct btrfs_r5l_payload *payload = (void *)mb + offset; + + /* read data from log disk and write to payload->location */ +#ifdef BTRFS_DEBUG_R5LOG + trace_printk("payload type %d flags %d size %d location 0x%llx devid %llu\n", le16_to_cpu(payload->type), le16_to_cpu(payload->flags), le32_to_cpu(payload->size), le64_to_cpu(payload->location), le64_to_cpu(payload->devid)); +#endif + + dev = btrfs_find_device(log->fs_info, le64_to_cpu(payload->devid), NULL, NULL); + if (!dev || dev->missing) { + ASSERT(0); + } + + if (le16_to_cpu(payload->type) == R5LOG_PAYLOAD_DATA) { + ASSERT(le32_to_cpu(payload->size) == 1); + btrfs_r5l_sync_page_io(log, log->dev, (ctx->pos + io_offset) >> 9, PAGE_SIZE, ctx->io_page, REQ_OP_READ); + btrfs_r5l_sync_page_io(log, dev, le64_to_cpu(payload->location) >> 9, PAGE_SIZE, ctx->io_page, REQ_OP_WRITE); + io_offset += PAGE_SIZE; + } else if (le16_to_cpu(payload->type) == R5LOG_PAYLOAD_PARITY) { + int i; + ASSERT(le32_to_cpu(payload->size) == 16); + for (i = 0; i < le32_to_cpu(payload->size); i++) { + /* liubo: parity are guaranteed to be +* contiguous, use just one bio to +* hold all pages and flush them. */ + u64 parity_off = le64_to_cpu(payload->location) + i * PAGE_SIZE; + btrfs_r5l_sync_page_io(log, log->dev, (ctx->pos + io_offset) >> 9, PAGE_SIZE, ctx->io_page, REQ_OP_READ); + btrfs_r5l_sync_page_io(log, dev, parity_off >> 9, PAGE_SIZE, ctx->io_page,
[PATCH 04/14] Btrfs: raid56: add verbose debug
Signed-off-by: Liu Bo--- fs/btrfs/raid56.c | 2 ++ fs/btrfs/volumes.c | 7 ++- fs/btrfs/volumes.h | 4 3 files changed, 12 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c index 2b91b95..c75766f 100644 --- a/fs/btrfs/raid56.c +++ b/fs/btrfs/raid56.c @@ -2753,7 +2753,9 @@ int btrfs_set_r5log(struct btrfs_fs_info *fs_info, struct btrfs_device *device) cmpxchg(_info->r5log, NULL, log); ASSERT(fs_info->r5log == log); +#ifdef BTRFS_DEBUG_R5LOG trace_printk("r5log: set a r5log in fs_info, alloc_range 0x%llx 0x%llx", log->data_offset, log->data_offset + log->device_size); +#endif return 0; } diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index a17a488..ac64d93 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -4731,8 +4731,13 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans, if (!device->in_fs_metadata || device->is_tgtdev_for_dev_replace || - (device->type & BTRFS_DEV_RAID56_LOG)) + (device->type & BTRFS_DEV_RAID56_LOG)) { +#ifdef BTRFS_DEBUG_R5LOG + if (device->type & BTRFS_DEV_RAID56_LOG) + btrfs_info(info, "skip a r5log when alloc chunk\n"); +#endif continue; + } if (device->total_bytes > device->bytes_used) total_avail = device->total_bytes - device->bytes_used; diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h index 60e347a..44cc3fa 100644 --- a/fs/btrfs/volumes.h +++ b/fs/btrfs/volumes.h @@ -26,6 +26,10 @@ extern struct mutex uuid_mutex; +#ifdef CONFIG_BTRFS_DEBUG +#define BTRFS_DEBUG_R5LOG +#endif + #define BTRFS_STRIPE_LEN SZ_64K struct buffer_head; -- 2.9.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] Btrfs-progs: introduce super_journal_tail to inspect-dump-super
We've record journal_tail of raid5/6 log in super_block so that recovery of raid5/6 log can scan from this position. This teaches inspect-dump-super to acknowledge %journal_tail. Signed-off-by: Liu Bo--- cmds-inspect-dump-super.c | 2 ++ ctree.h | 6 +- 2 files changed, 7 insertions(+), 1 deletion(-) diff --git a/cmds-inspect-dump-super.c b/cmds-inspect-dump-super.c index 98e0270..baa4d1a 100644 --- a/cmds-inspect-dump-super.c +++ b/cmds-inspect-dump-super.c @@ -389,6 +389,8 @@ static void dump_superblock(struct btrfs_super_block *sb, int full) (unsigned long long)btrfs_super_log_root_transid(sb)); printf("log_root_level\t\t%llu\n", (unsigned long long)btrfs_super_log_root_level(sb)); + printf("journal_tail\t\t%llu\n", + (unsigned long long)btrfs_super_journal_tail(sb)); printf("total_bytes\t\t%llu\n", (unsigned long long)btrfs_super_total_bytes(sb)); printf("bytes_used\t\t%llu\n", diff --git a/ctree.h b/ctree.h index 48ae890..d28d6f7 100644 --- a/ctree.h +++ b/ctree.h @@ -458,8 +458,10 @@ struct btrfs_super_block { __le64 cache_generation; __le64 uuid_tree_generation; + __le64 journal_tail; + /* future expansion */ - __le64 reserved[30]; + __le64 reserved[29]; u8 sys_chunk_array[BTRFS_SYSTEM_CHUNK_ARRAY_SIZE]; struct btrfs_root_backup super_roots[BTRFS_NUM_BACKUP_ROOTS]; } __attribute__ ((__packed__)); @@ -2143,6 +2145,8 @@ BTRFS_SETGET_STACK_FUNCS(super_log_root_transid, struct btrfs_super_block, log_root_transid, 64); BTRFS_SETGET_STACK_FUNCS(super_log_root_level, struct btrfs_super_block, log_root_level, 8); +BTRFS_SETGET_STACK_FUNCS(super_journal_tail, struct btrfs_super_block, +journal_tail, 64); BTRFS_SETGET_STACK_FUNCS(super_total_bytes, struct btrfs_super_block, total_bytes, 64); BTRFS_SETGET_STACK_FUNCS(super_bytes_used, struct btrfs_super_block, -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 13/14] Btrfs: raid56: initialize raid5/6 log after adding it
We need to initialize the raid5/6 log after adding it, but we don't want to race with concurrent writes. So we initialize it before assigning the log pointer in %fs_info. Signed-off-by: Liu Bo--- fs/btrfs/disk-io.c | 2 +- fs/btrfs/raid56.c | 18 -- fs/btrfs/raid56.h | 3 ++- fs/btrfs/volumes.c | 2 ++ 4 files changed, 21 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index c2d8697..3fbd347 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -3000,7 +3000,7 @@ int open_ctree(struct super_block *sb, * write-ahead log, the fsync'd data will never ends * up with being replayed by raid56 log. */ - btrfs_r5l_load_log(fs_info, cp); + btrfs_r5l_load_log(fs_info, NULL, cp); } ret = btrfs_recover_balance(fs_info); diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c index 0bfc97a..b771d7d 100644 --- a/fs/btrfs/raid56.c +++ b/fs/btrfs/raid56.c @@ -1943,14 +1943,28 @@ static int btrfs_r5l_recover_log(struct btrfs_r5l_log *log) } /* return 0 if success, otherwise return errors */ -int btrfs_r5l_load_log(struct btrfs_fs_info *fs_info, u64 cp) +int btrfs_r5l_load_log(struct btrfs_fs_info *fs_info, struct btrfs_r5l_log *r5log, u64 cp) { - struct btrfs_r5l_log *log = fs_info->r5log; + struct btrfs_r5l_log *log; struct page *page; struct btrfs_r5l_meta_block *mb; bool create_new = false; int ret; + if (r5log) + ASSERT(fs_info->r5log == NULL); + if (fs_info->r5log) + ASSERT(r5log == NULL); + + if (fs_info->r5log) + log = fs_info->r5log; + else + /* +* this only happens when adding the raid56 log for +* the first time. +*/ + log = r5log; + ASSERT(log); page = alloc_page(GFP_NOFS | __GFP_HIGHMEM); diff --git a/fs/btrfs/raid56.h b/fs/btrfs/raid56.h index f6d6f36..2cc64a3 100644 --- a/fs/btrfs/raid56.h +++ b/fs/btrfs/raid56.h @@ -140,5 +140,6 @@ struct btrfs_r5l_log * btrfs_r5l_init_log_prepare(struct btrfs_fs_info *fs_info, void btrfs_r5l_init_log_post(struct btrfs_fs_info *fs_info, struct btrfs_r5l_log *log); int btrfs_set_r5log(struct btrfs_fs_info *fs_info, struct btrfs_device *device); -int btrfs_r5l_load_log(struct btrfs_fs_info *fs_info, u64 cp); +int btrfs_r5l_load_log(struct btrfs_fs_info *fs_info, + struct btrfs_r5l_log *r5log, u64 cp); #endif diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 851c001..7f848d7 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -2521,6 +2521,8 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path } if (is_r5log) { + /* initialize r5log with cp == 0. */ + btrfs_r5l_load_log(fs_info, r5log, 0); btrfs_r5l_init_log_post(fs_info, r5log); } -- 2.9.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 03/14] Btrfs: raid56: detect raid56 log on mount
We've put the flag BTRFS_DEV_RAID56_LOG in device->type, so we can recognize the journal device of raid56 while reading the chunk tree. Signed-off-by: Liu Bo--- fs/btrfs/volumes.c | 12 1 file changed, 12 insertions(+) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 5c50df7..a17a488 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -6696,6 +6696,18 @@ static int read_one_dev(struct btrfs_fs_info *fs_info, } fill_device_from_item(leaf, dev_item, device); + + if (device->type & BTRFS_DEV_RAID56_LOG) { + ret = btrfs_set_r5log(fs_info, device); + if (ret) { + btrfs_err(fs_info, "error %d on loading r5log", ret); + return ret; + } + + btrfs_info(fs_info, "devid %llu uuid %pU is raid56 log", + device->devid, device->uuid); + } + device->in_fs_metadata = 1; if (device->writeable && !device->is_tgtdev_for_dev_replace) { device->fs_devices->total_rw_bytes += device->total_bytes; -- 2.9.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 02/14] Btrfs: raid56: do not allocate chunk on raid56 log
The journal device (aka raid56 log) is not for chunk allocation, lets skip it. Signed-off-by: Liu Bo--- fs/btrfs/volumes.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index dafc541..5c50df7 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -4730,7 +4730,8 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans, } if (!device->in_fs_metadata || - device->is_tgtdev_for_dev_replace) + device->is_tgtdev_for_dev_replace || + (device->type & BTRFS_DEV_RAID56_LOG)) continue; if (device->total_bytes > device->bytes_used) -- 2.9.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes
This aims to fix write hole issue on btrfs raid5/6 setup by adding a separate disk as a journal (aka raid5/6 log), so that after unclean shutdown we can make sure data and parity are consistent on the raid array by replaying the journal. The idea and the code are similar to the write-through mode of md raid5-cache, so ppl(partial parity log) is also feasible to implement. (If you've been familiar with md, you may find this patch set is boring to read...) Patch 1-3 are about adding a log disk, patch 5-8 are the main part of the implementation, the rest patches are improvements and bugfixes, eg. readahead for recovery, checksum. Two btrfs-progs patches are required to play with this patch set, one is to enhance 'btrfs device add' to add a disk as raid5/6 log with the option '-L', the other is to teach 'btrfs-show-super' to show %journal_tail. This is currently based on 4.12-rc3. The patch set is tagged with RFC, and comments are always welcome, thanks. Known limitations: - Deleting a log device is not implemented yet. Liu Bo (14): Btrfs: raid56: add raid56 log via add_dev v2 ioctl Btrfs: raid56: do not allocate chunk on raid56 log Btrfs: raid56: detect raid56 log on mount Btrfs: raid56: add verbose debug Btrfs: raid56: add stripe log for raid5/6 Btrfs: raid56: add reclaim support Btrfs: raid56: load r5log Btrfs: raid56: log recovery Btrfs: raid56: add readahead for recovery Btrfs: raid56: use the readahead helper to get page Btrfs: raid56: add csum support Btrfs: raid56: fix error handling while adding a log device Btrfs: raid56: initialize raid5/6 log after adding it Btrfs: raid56: maintain IO order on raid5/6 log fs/btrfs/ctree.h| 16 +- fs/btrfs/disk-io.c | 16 + fs/btrfs/ioctl.c| 48 +- fs/btrfs/raid56.c | 1429 ++- fs/btrfs/raid56.h | 82 +++ fs/btrfs/transaction.c |2 + fs/btrfs/volumes.c | 56 +- fs/btrfs/volumes.h |7 +- include/uapi/linux/btrfs.h |3 + include/uapi/linux/btrfs_tree.h |4 + 10 files changed, 1487 insertions(+), 176 deletions(-) -- 2.9.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 11/14] Btrfs: raid56: add csum support
This is adding checksum to meta/data/parity resident on the raid5/6 log. So recovery now can verify checksum to see if anything inside meta/data/parity has been changed. If anything is wrong in meta block, we stops replaying data/parity at that position, while if anything is wrong in data/parity block, we just skip this this meta/data/parity pair and move onto the next one. Signed-off-by: Liu Bo--- fs/btrfs/raid56.c | 235 -- fs/btrfs/raid56.h | 4 + 2 files changed, 197 insertions(+), 42 deletions(-) diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c index 8f47e56..8bc7ba4 100644 --- a/fs/btrfs/raid56.c +++ b/fs/btrfs/raid56.c @@ -43,6 +43,7 @@ #include "async-thread.h" #include "check-integrity.h" #include "rcu-string.h" +#include "hash.h" /* set when additional merges to this rbio are not allowed */ #define RBIO_RMW_LOCKED_BIT1 @@ -197,6 +198,7 @@ struct btrfs_r5l_log { u64 last_cp_seq; u64 seq; u64 log_start; + u32 uuid_csum; struct btrfs_r5l_io_unit *current_io; }; @@ -1309,7 +1311,7 @@ static int btrfs_r5l_get_meta(struct btrfs_r5l_log *log, struct btrfs_raid_bio * return 0; } -static void btrfs_r5l_append_payload_meta(struct btrfs_r5l_log *log, u16 type, u64 location, u64 devid) +static void btrfs_r5l_append_payload_meta(struct btrfs_r5l_log *log, u16 type, u64 location, u64 devid, u32 csum) { struct btrfs_r5l_io_unit *io = log->current_io; struct btrfs_r5l_payload *payload; @@ -1326,11 +1328,11 @@ static void btrfs_r5l_append_payload_meta(struct btrfs_r5l_log *log, u16 type, u payload->size = cpu_to_le32(16); /* stripe_len / PAGE_SIZE */ payload->devid = cpu_to_le64(devid); payload->location = cpu_to_le64(location); + payload->csum = cpu_to_le32(csum); kunmap(io->meta_page); - /* XXX: add checksum later */ io->meta_offset += sizeof(*payload); - //io->meta_offset += sizeof(__le32); + #ifdef BTRFS_DEBUG_R5LOG trace_printk("io->meta_offset %d\n", io->meta_offset); #endif @@ -1380,6 +1382,10 @@ static void btrfs_r5l_log_stripe(struct btrfs_r5l_log *log, int data_pages, int int meta_size; int stripe, pagenr; struct page *page; + char *kaddr; + u32 csum; + u64 location; + u64 devid; /* * parity pages are contiguous on disk, thus only one @@ -1394,8 +1400,6 @@ static void btrfs_r5l_log_stripe(struct btrfs_r5l_log *log, int data_pages, int /* add data blocks which need to be written */ for (stripe = 0; stripe < rbio->nr_data; stripe++) { for (pagenr = 0; pagenr < rbio->stripe_npages; pagenr++) { - u64 location; - u64 devid; if (stripe < rbio->nr_data) { page = page_in_rbio(rbio, stripe, pagenr, 1); if (!page) @@ -1406,7 +1410,11 @@ static void btrfs_r5l_log_stripe(struct btrfs_r5l_log *log, int data_pages, int #ifdef BTRFS_DEBUG_R5LOG trace_printk("data: stripe %d pagenr %d location 0x%llx devid %llu\n", stripe, pagenr, location, devid); #endif - btrfs_r5l_append_payload_meta(log, R5LOG_PAYLOAD_DATA, location, devid); + kaddr = kmap(page); + csum = btrfs_crc32c(log->uuid_csum, kaddr, PAGE_SIZE); + kunmap(page); + + btrfs_r5l_append_payload_meta(log, R5LOG_PAYLOAD_DATA, location, devid, csum); btrfs_r5l_append_payload_page(log, page); } } @@ -1414,17 +1422,26 @@ static void btrfs_r5l_log_stripe(struct btrfs_r5l_log *log, int data_pages, int /* add the whole parity blocks */ for (; stripe < rbio->real_stripes; stripe++) { - u64 location = btrfs_compute_location(rbio, stripe, 0); - u64 devid = btrfs_compute_devid(rbio, stripe); + location = btrfs_compute_location(rbio, stripe, 0); + devid = btrfs_compute_devid(rbio, stripe); #ifdef BTRFS_DEBUG_R5LOG trace_printk("parity: stripe %d location 0x%llx devid %llu\n", stripe, location, devid); #endif - btrfs_r5l_append_payload_meta(log, R5LOG_PAYLOAD_PARITY, location, devid); for (pagenr = 0; pagenr < rbio->stripe_npages; pagenr++) { page = rbio_stripe_page(rbio, stripe, pagenr); + + kaddr = kmap(page); + if (pagenr == 0) + csum = btrfs_crc32c(log->uuid_csum, kaddr, PAGE_SIZE); + else + csum = btrfs_crc32c(csum, kaddr, PAGE_SIZE); + kunmap(page);
Re: Massive loss of disk space
On 2017-08-01 12:50, pwm wrote: I did a temporary patch of the snapraid code to start fallocate() from the previous parity file size. Like I said though, it's BTRFS that's misbehaving here, not snapraid. I'm going to try to get some further discussion about this here on the mailing list,and hopefully it will get fixed in BTRFS (I would try to do so myself, but I'm at best a novice at C, and not well versed in kernel code). Finally have a snapraid sync up and running. Looks good, but will take quite a while before I can try a scrub command to double-check everything. Thanks for the help. Glad I could be helpful! /Per W On Tue, 1 Aug 2017, Austin S. Hemmelgarn wrote: On 2017-08-01 11:24, pwm wrote: Yes, the test code is as below - trying to match what snapraid tries to do: #include #include #include #include #include #include #include int main() { int fd = open("/mnt/snap_04/snapraid.parity",O_NOFOLLOW|O_RDWR); if (fd < 0) { printf("Failed opening parity file [%s]\n",strerror(errno)); return 1; } off_t filesize = 5151751667712ull; int res; struct stat statbuf; if (fstat(fd,)) { printf("Failed stat [%s]\n",strerror(errno)); close(fd); return 1; } printf("Original file size is %llu bytes\n",i (unsigned long long)statbuf.st_size); printf("Trying to grow file to %llu bytes\n",i (unsigned long long)filesize); res = fallocate(fd,0,0,filesize); if (res) { printf("Failed fallocate [%s]\n",strerror(errno)); close(fd); return 1; } if (fsync(fd)) { printf("Failed fsync [%s]\n",fsync(errno)); close(fd); return 1; } close(fd); return 0; } So the call doesn't make use of the previous file size as offset for the extension. int fallocate(int fd, int mode, off_t offset, off_t len); What you are implying here is that if the fallocate() call is modified to: res = fallocate(fd,0,old_size,new_size-old_size); then everything should work as expected? Based on what I've seen testing on my end, yes, that should cause things to work correctly. That said, given what snapraid does, the fact that they call fallocate covering the full desired size of the file is correct usage (the point is to make behavior deterministic, and calling it on the whole file makes sure that the file isn't sparse, which can impact performance). Given both the fact that calling fallocate() to extend the file without worrying about an offset is a legitimate use case, and that both ext4 and XFS (and I suspect almost every other Linux filesystem) works in this situation, I'd argue that the behavior of BTRFS in this situation is incorrect. /Per W On Tue, 1 Aug 2017, Austin S. Hemmelgarn wrote: On 2017-08-01 10:47, Austin S. Hemmelgarn wrote: On 2017-08-01 10:39, pwm wrote: Thanks for the links and suggestions. I did try your suggestions but it didn't solve the underlying problem. pwm@europium:~$ sudo btrfs balance start -v -dusage=20 /mnt/snap_04 Dumping filters: flags 0x1, state 0x0, force is off DATA (flags 0x2): balancing, usage=20 Done, had to relocate 4596 out of 9317 chunks pwm@europium:~$ sudo btrfs balance start -mconvert=dup,soft /mnt/snap_04/ Done, had to relocate 2 out of 4721 chunks pwm@europium:~$ sudo btrfs fi df /mnt/snap_04 Data, single: total=4.60TiB, used=4.59TiB System, DUP: total=40.00MiB, used=512.00KiB Metadata, DUP: total=6.50GiB, used=4.81GiB GlobalReserve, single: total=512.00MiB, used=0.00B pwm@europium:~$ sudo btrfs fi show /mnt/snap_04 Label: 'snap_04' uuid: c46df8fa-03db-4b32-8beb-5521d9931a31 Total devices 1 FS bytes used 4.60TiB devid1 size 9.09TiB used 4.61TiB path /dev/sdg1 So now device 1 usage is down from 9.09TiB to 4.61TiB. But if I test to fallocate() to grow the large parity file, I directly fail. I wrote a little help program that just focuses on fallocate() instead of having to run snapraid with lots of unknown additional actions being performed. Original file size is 5050486226944 bytes Trying to grow file to 5151751667712 bytes Failed fallocate [No space left on device] And result after shows 'used' have jumped up to 9.09TiB again. root@europium:/mnt# btrfs fi show snap_04 Label: 'snap_04' uuid: c46df8fa-03db-4b32-8beb-5521d9931a31 Total devices 1 FS bytes used 4.60TiB devid1 size 9.09TiB used 9.09TiB path /dev/sdg1 root@europium:/mnt# btrfs fi df /mnt/snap_04/ Data, single: total=9.08TiB, used=4.59TiB System, DUP: total=40.00MiB, used=992.00KiB Metadata, DUP: total=6.50GiB, used=4.81GiB GlobalReserve, single: total=512.00MiB, used=0.00B It's almost like the file system have decided that it needs to make a snapshot and store two complete copies of the complete file, which is obviously not going to work with a file larger than 50% of the file system. I think I _might_
Re: Massive loss of disk space
I did a temporary patch of the snapraid code to start fallocate() from the previous parity file size. Finally have a snapraid sync up and running. Looks good, but will take quite a while before I can try a scrub command to double-check everything. Thanks for the help. /Per W On Tue, 1 Aug 2017, Austin S. Hemmelgarn wrote: On 2017-08-01 11:24, pwm wrote: Yes, the test code is as below - trying to match what snapraid tries to do: #include #include #include #include #include #include #include int main() { int fd = open("/mnt/snap_04/snapraid.parity",O_NOFOLLOW|O_RDWR); if (fd < 0) { printf("Failed opening parity file [%s]\n",strerror(errno)); return 1; } off_t filesize = 5151751667712ull; int res; struct stat statbuf; if (fstat(fd,)) { printf("Failed stat [%s]\n",strerror(errno)); close(fd); return 1; } printf("Original file size is %llu bytes\n",i (unsigned long long)statbuf.st_size); printf("Trying to grow file to %llu bytes\n",i (unsigned long long)filesize); res = fallocate(fd,0,0,filesize); if (res) { printf("Failed fallocate [%s]\n",strerror(errno)); close(fd); return 1; } if (fsync(fd)) { printf("Failed fsync [%s]\n",fsync(errno)); close(fd); return 1; } close(fd); return 0; } So the call doesn't make use of the previous file size as offset for the extension. int fallocate(int fd, int mode, off_t offset, off_t len); What you are implying here is that if the fallocate() call is modified to: res = fallocate(fd,0,old_size,new_size-old_size); then everything should work as expected? Based on what I've seen testing on my end, yes, that should cause things to work correctly. That said, given what snapraid does, the fact that they call fallocate covering the full desired size of the file is correct usage (the point is to make behavior deterministic, and calling it on the whole file makes sure that the file isn't sparse, which can impact performance). Given both the fact that calling fallocate() to extend the file without worrying about an offset is a legitimate use case, and that both ext4 and XFS (and I suspect almost every other Linux filesystem) works in this situation, I'd argue that the behavior of BTRFS in this situation is incorrect. /Per W On Tue, 1 Aug 2017, Austin S. Hemmelgarn wrote: On 2017-08-01 10:47, Austin S. Hemmelgarn wrote: On 2017-08-01 10:39, pwm wrote: Thanks for the links and suggestions. I did try your suggestions but it didn't solve the underlying problem. pwm@europium:~$ sudo btrfs balance start -v -dusage=20 /mnt/snap_04 Dumping filters: flags 0x1, state 0x0, force is off DATA (flags 0x2): balancing, usage=20 Done, had to relocate 4596 out of 9317 chunks pwm@europium:~$ sudo btrfs balance start -mconvert=dup,soft /mnt/snap_04/ Done, had to relocate 2 out of 4721 chunks pwm@europium:~$ sudo btrfs fi df /mnt/snap_04 Data, single: total=4.60TiB, used=4.59TiB System, DUP: total=40.00MiB, used=512.00KiB Metadata, DUP: total=6.50GiB, used=4.81GiB GlobalReserve, single: total=512.00MiB, used=0.00B pwm@europium:~$ sudo btrfs fi show /mnt/snap_04 Label: 'snap_04' uuid: c46df8fa-03db-4b32-8beb-5521d9931a31 Total devices 1 FS bytes used 4.60TiB devid1 size 9.09TiB used 4.61TiB path /dev/sdg1 So now device 1 usage is down from 9.09TiB to 4.61TiB. But if I test to fallocate() to grow the large parity file, I directly fail. I wrote a little help program that just focuses on fallocate() instead of having to run snapraid with lots of unknown additional actions being performed. Original file size is 5050486226944 bytes Trying to grow file to 5151751667712 bytes Failed fallocate [No space left on device] And result after shows 'used' have jumped up to 9.09TiB again. root@europium:/mnt# btrfs fi show snap_04 Label: 'snap_04' uuid: c46df8fa-03db-4b32-8beb-5521d9931a31 Total devices 1 FS bytes used 4.60TiB devid1 size 9.09TiB used 9.09TiB path /dev/sdg1 root@europium:/mnt# btrfs fi df /mnt/snap_04/ Data, single: total=9.08TiB, used=4.59TiB System, DUP: total=40.00MiB, used=992.00KiB Metadata, DUP: total=6.50GiB, used=4.81GiB GlobalReserve, single: total=512.00MiB, used=0.00B It's almost like the file system have decided that it needs to make a snapshot and store two complete copies of the complete file, which is obviously not going to work with a file larger than 50% of the file system. I think I _might_ understand what's going on here. Is that test program calling fallocate using the desired total size of the file, or just trying to allocate the range beyond the end to extend the file? I've seen issues with the first case on BTRFS before, and I'm starting to think that it might actually be trying to allocate the exact amount of space requested by fallocate,
Re: 4.11.6 / more corruption / root 15455 has a root item with a more recent gen (33682) compared to the found root node (0)
2017-08-01 0:39 GMT+03:00 Ivan Sizov: > 2017-08-01 0:17 GMT+03:00 Marc MERLIN : >> On Tue, Aug 01, 2017 at 12:07:14AM +0300, Ivan Sizov wrote: >>> 2017-07-09 10:57 GMT+03:00 Martin Steigerwald : >>> > Hello Marc. >>> > >>> > Marc MERLIN - 08.07.17, 21:34: >>> >> Sigh, >>> >> >>> >> This is now the 3rd filesystem I have (on 3 different machines) that is >>> >> getting corruption of some kind (on 4.11.6). >>> > >>> > Anyone else getting corruptions with 4.11? >>> Yes, a lot. There are at least 3 cases, probably I've missed something. >>> https://www.spinics.net/lists/linux-btrfs/msg67177.html >>> https://www.spinics.net/lists/linux-btrfs/msg67681.html >>> https://unix.stackexchange.com/questions/369133/dealing-with-btrfs-ref-backpointer-mismatches-backref-missing/369275 >> >> Indeed. My main server is happy back on 4.9.36 and while my laptop is >> stuck on 4.11 due to other kernel issues that prevent me from going back >> to 4.9, it only corrupted a single filesystem so far, and no other ones >> that I've noticed yet. >> Hopefully that will hold :-/ >> >> Marc >> -- >> "A mouse is a device used to point at the xterm you want to type in" - A.S.R. >> Microsoft is to operating systems >> what McDonalds is to gourmet >> cooking >> Home page: http://marc.merlins.org/ | PGP >> 1024R/763BE901 > > I want to try mounting and checking FS under Live images with > different kernels tomorrow. Today's Fedora Rawhide image seems to be > built incorrectly. Can you advice me where to get a fresh live image > with 4.12 kernel (it's not important which distro that will be)? > > -- > Ivan Sizov Mounting problem persists: on 4.13.0 with btrfs-progs v4.11.1 (latest Fedora Rawhide Live) on 4.10.0 with btrfs-progs v4.9.1 (Ubuntu 17.04 Live) on 4.9.0 with btrfs-progs v 4.7.3 (Debian 9 Stretch Live) "btrfs check --readonly" also gives the same output on 4.11, 4.10 and 4.9. Marc, how did you roll back and fix those errors? -- Ivan Sizov -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Massive loss of disk space
On 2017-08-01 11:24, pwm wrote: Yes, the test code is as below - trying to match what snapraid tries to do: #include #include #include #include #include #include #include int main() { int fd = open("/mnt/snap_04/snapraid.parity",O_NOFOLLOW|O_RDWR); if (fd < 0) { printf("Failed opening parity file [%s]\n",strerror(errno)); return 1; } off_t filesize = 5151751667712ull; int res; struct stat statbuf; if (fstat(fd,)) { printf("Failed stat [%s]\n",strerror(errno)); close(fd); return 1; } printf("Original file size is %llu bytes\n",i (unsigned long long)statbuf.st_size); printf("Trying to grow file to %llu bytes\n",i (unsigned long long)filesize); res = fallocate(fd,0,0,filesize); if (res) { printf("Failed fallocate [%s]\n",strerror(errno)); close(fd); return 1; } if (fsync(fd)) { printf("Failed fsync [%s]\n",fsync(errno)); close(fd); return 1; } close(fd); return 0; } So the call doesn't make use of the previous file size as offset for the extension. int fallocate(int fd, int mode, off_t offset, off_t len); What you are implying here is that if the fallocate() call is modified to: res = fallocate(fd,0,old_size,new_size-old_size); then everything should work as expected? Based on what I've seen testing on my end, yes, that should cause things to work correctly. That said, given what snapraid does, the fact that they call fallocate covering the full desired size of the file is correct usage (the point is to make behavior deterministic, and calling it on the whole file makes sure that the file isn't sparse, which can impact performance). Given both the fact that calling fallocate() to extend the file without worrying about an offset is a legitimate use case, and that both ext4 and XFS (and I suspect almost every other Linux filesystem) works in this situation, I'd argue that the behavior of BTRFS in this situation is incorrect. /Per W On Tue, 1 Aug 2017, Austin S. Hemmelgarn wrote: On 2017-08-01 10:47, Austin S. Hemmelgarn wrote: On 2017-08-01 10:39, pwm wrote: Thanks for the links and suggestions. I did try your suggestions but it didn't solve the underlying problem. pwm@europium:~$ sudo btrfs balance start -v -dusage=20 /mnt/snap_04 Dumping filters: flags 0x1, state 0x0, force is off DATA (flags 0x2): balancing, usage=20 Done, had to relocate 4596 out of 9317 chunks pwm@europium:~$ sudo btrfs balance start -mconvert=dup,soft /mnt/snap_04/ Done, had to relocate 2 out of 4721 chunks pwm@europium:~$ sudo btrfs fi df /mnt/snap_04 Data, single: total=4.60TiB, used=4.59TiB System, DUP: total=40.00MiB, used=512.00KiB Metadata, DUP: total=6.50GiB, used=4.81GiB GlobalReserve, single: total=512.00MiB, used=0.00B pwm@europium:~$ sudo btrfs fi show /mnt/snap_04 Label: 'snap_04' uuid: c46df8fa-03db-4b32-8beb-5521d9931a31 Total devices 1 FS bytes used 4.60TiB devid1 size 9.09TiB used 4.61TiB path /dev/sdg1 So now device 1 usage is down from 9.09TiB to 4.61TiB. But if I test to fallocate() to grow the large parity file, I directly fail. I wrote a little help program that just focuses on fallocate() instead of having to run snapraid with lots of unknown additional actions being performed. Original file size is 5050486226944 bytes Trying to grow file to 5151751667712 bytes Failed fallocate [No space left on device] And result after shows 'used' have jumped up to 9.09TiB again. root@europium:/mnt# btrfs fi show snap_04 Label: 'snap_04' uuid: c46df8fa-03db-4b32-8beb-5521d9931a31 Total devices 1 FS bytes used 4.60TiB devid1 size 9.09TiB used 9.09TiB path /dev/sdg1 root@europium:/mnt# btrfs fi df /mnt/snap_04/ Data, single: total=9.08TiB, used=4.59TiB System, DUP: total=40.00MiB, used=992.00KiB Metadata, DUP: total=6.50GiB, used=4.81GiB GlobalReserve, single: total=512.00MiB, used=0.00B It's almost like the file system have decided that it needs to make a snapshot and store two complete copies of the complete file, which is obviously not going to work with a file larger than 50% of the file system. I think I _might_ understand what's going on here. Is that test program calling fallocate using the desired total size of the file, or just trying to allocate the range beyond the end to extend the file? I've seen issues with the first case on BTRFS before, and I'm starting to think that it might actually be trying to allocate the exact amount of space requested by fallocate, even if part of the range is already allocated space. OK, I just did a dead simple test by hand, and it looks like I was right. The method I used to check this is as follows: 1. Create and mount a reasonably small filesystem (I used an 8G temporary LV for this, a file would work too though). 2. Using dd or a similar tool,
Re: [PATCH] btrfs: copy fsid to super_block s_uuid
On Tue, Aug 01, 2017 at 06:35:08PM +0800, Anand Jain wrote: > We didn't copy fsid to struct super_block.s_uuid so Overlay disables > index feature with btrfs as the lower FS. > > kernel: overlayfs: fs on '/lower' does not support file handles, falling back > to index=off. > > Fix this by publishing the fsid through struct super_block.s_uuid. > > Signed-off-by: Anand Jain> --- > I tried to know if in case did we deliberately missed this for some reason, > however there is no information on that. If we mount a non-default subvol in > the next mount/remount, its still the same FS, so publishing the FSID > instead of subvol uuid is correct, OR I can't think any other reason for > not using s_uuid for btrfs. > > > fs/btrfs/disk-io.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c > index 080e2ebb8aa0..b7e72d040442 100644 > --- a/fs/btrfs/disk-io.c > +++ b/fs/btrfs/disk-io.c > @@ -2899,6 +2899,7 @@ int open_ctree(struct super_block *sb, > > sb->s_blocksize = sectorsize; > sb->s_blocksize_bits = blksize_bits(sectorsize); > + memcpy(>s_uuid, fs_info->fsid, BTRFS_FSID_SIZE); uuid_copy()? --D > > mutex_lock(_info->chunk_mutex); > ret = btrfs_read_sys_array(fs_info); > -- > 2.13.1 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs: remove broken memory barrier
Commit 38851cc19adb ("Btrfs: implement unlocked dio write") implemented unlocked dio write, allowing multiple dio writers to write to non-overlapping, and non-eof-extending regions. In doing so it also introduced a broken memory barrier. It is broken due to 2 things: 1. Memory barriers _MUST_ always be paired, this is clearly not the case here 2. Checkpatch actually produces a warning if a memory barrier is introduced that doesn't have a comment explaining how it's being paired. Signed-off-by: Nikolay Borisov--- fs/btrfs/inode.c | 1 - 1 file changed, 1 deletion(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 95c212037095..5e48d2c10152 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -8731,7 +8731,6 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter) return 0; inode_dio_begin(inode); - smp_mb__after_atomic(); /* * The generic stuff only does filemap_write_and_wait_range, which -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Massive loss of disk space
Yes, the test code is as below - trying to match what snapraid tries to do: #include #include #include #include #include #include #include int main() { int fd = open("/mnt/snap_04/snapraid.parity",O_NOFOLLOW|O_RDWR); if (fd < 0) { printf("Failed opening parity file [%s]\n",strerror(errno)); return 1; } off_t filesize = 5151751667712ull; int res; struct stat statbuf; if (fstat(fd,)) { printf("Failed stat [%s]\n",strerror(errno)); close(fd); return 1; } printf("Original file size is %llu bytes\n",i (unsigned long long)statbuf.st_size); printf("Trying to grow file to %llu bytes\n",i (unsigned long long)filesize); res = fallocate(fd,0,0,filesize); if (res) { printf("Failed fallocate [%s]\n",strerror(errno)); close(fd); return 1; } if (fsync(fd)) { printf("Failed fsync [%s]\n",fsync(errno)); close(fd); return 1; } close(fd); return 0; } So the call doesn't make use of the previous file size as offset for the extension. int fallocate(int fd, int mode, off_t offset, off_t len); What you are implying here is that if the fallocate() call is modified to: res = fallocate(fd,0,old_size,new_size-old_size); then everything should work as expected? /Per W On Tue, 1 Aug 2017, Austin S. Hemmelgarn wrote: On 2017-08-01 10:47, Austin S. Hemmelgarn wrote: On 2017-08-01 10:39, pwm wrote: Thanks for the links and suggestions. I did try your suggestions but it didn't solve the underlying problem. pwm@europium:~$ sudo btrfs balance start -v -dusage=20 /mnt/snap_04 Dumping filters: flags 0x1, state 0x0, force is off DATA (flags 0x2): balancing, usage=20 Done, had to relocate 4596 out of 9317 chunks pwm@europium:~$ sudo btrfs balance start -mconvert=dup,soft /mnt/snap_04/ Done, had to relocate 2 out of 4721 chunks pwm@europium:~$ sudo btrfs fi df /mnt/snap_04 Data, single: total=4.60TiB, used=4.59TiB System, DUP: total=40.00MiB, used=512.00KiB Metadata, DUP: total=6.50GiB, used=4.81GiB GlobalReserve, single: total=512.00MiB, used=0.00B pwm@europium:~$ sudo btrfs fi show /mnt/snap_04 Label: 'snap_04' uuid: c46df8fa-03db-4b32-8beb-5521d9931a31 Total devices 1 FS bytes used 4.60TiB devid1 size 9.09TiB used 4.61TiB path /dev/sdg1 So now device 1 usage is down from 9.09TiB to 4.61TiB. But if I test to fallocate() to grow the large parity file, I directly fail. I wrote a little help program that just focuses on fallocate() instead of having to run snapraid with lots of unknown additional actions being performed. Original file size is 5050486226944 bytes Trying to grow file to 5151751667712 bytes Failed fallocate [No space left on device] And result after shows 'used' have jumped up to 9.09TiB again. root@europium:/mnt# btrfs fi show snap_04 Label: 'snap_04' uuid: c46df8fa-03db-4b32-8beb-5521d9931a31 Total devices 1 FS bytes used 4.60TiB devid1 size 9.09TiB used 9.09TiB path /dev/sdg1 root@europium:/mnt# btrfs fi df /mnt/snap_04/ Data, single: total=9.08TiB, used=4.59TiB System, DUP: total=40.00MiB, used=992.00KiB Metadata, DUP: total=6.50GiB, used=4.81GiB GlobalReserve, single: total=512.00MiB, used=0.00B It's almost like the file system have decided that it needs to make a snapshot and store two complete copies of the complete file, which is obviously not going to work with a file larger than 50% of the file system. I think I _might_ understand what's going on here. Is that test program calling fallocate using the desired total size of the file, or just trying to allocate the range beyond the end to extend the file? I've seen issues with the first case on BTRFS before, and I'm starting to think that it might actually be trying to allocate the exact amount of space requested by fallocate, even if part of the range is already allocated space. OK, I just did a dead simple test by hand, and it looks like I was right. The method I used to check this is as follows: 1. Create and mount a reasonably small filesystem (I used an 8G temporary LV for this, a file would work too though). 2. Using dd or a similar tool, create a test file that takes up half of the size of the filesystem. It is important that this _not_ be fallocated, but just written out. 3. Use `fallocate -l` to try and extend the size of the file beyond half the size of the filesystem. For BTRFS, this will result in -ENOSPC, while for ext4 and XFS, it will succeed with no error. Based on this and some low-level inspection, it looks like BTRFS treats the full range of the fallocate call as unallocated, and thus is trying to allocate space for regions of that range that are already allocated. No issue at all to grow the parity file on the other parity disk. And that's why I wonder if there is some undetected file system corruption. -- To unsubscribe from this list:
Re: 4.11.6 / more corruption / root 15455 has a root item with a more recent gen (33682) compared to the found root node (0)
On 8/1/17, Duncan <1i5t5.dun...@cox.net> wrote: > Imran Geriskovan posted on Mon, 31 Jul 2017 22:32:39 +0200 as excerpted: Now the init on /boot is a "19 lines" shell script, including lines for keymap, hdparm, crytpsetup. And let's not forget this is possible by a custom kernel and its reliable buddy syslinux. >>> And I'm using dracut for that, tho quite cut down from its default, >>> with a monolithic kernel and only installing necessary dracut modules. >> Just create minimal bootable /boot for running below init. >> (Your initramfs/rd is a bloated and packaged version of this anyway.) >> Kick the rest. Since you a have your own kernel you are not far away >> from it. > Thanks. You just solved my primary problem of needing to take the time > to actually research all the steps and in what order I needed to do them, > for a hand-rolled script. =:^) It's just a minimal one. But it is a good start. For possible extensions extract your initramfs and explore it. Dracut is bloated. Try mkinitcpio. Once your have your self hosting bootmng, kernel, modules, /boot, init, etc chain, you'll be shocked to realize you have been spending so much time for that bullshit while trying to keep them up.. Get to this point in the shortest possible time. Save your precious time. And reclaim your systems reliability. For X, you'll still need udev or eudev. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Massive loss of disk space
On 2017-08-01 10:47, Austin S. Hemmelgarn wrote: On 2017-08-01 10:39, pwm wrote: Thanks for the links and suggestions. I did try your suggestions but it didn't solve the underlying problem. pwm@europium:~$ sudo btrfs balance start -v -dusage=20 /mnt/snap_04 Dumping filters: flags 0x1, state 0x0, force is off DATA (flags 0x2): balancing, usage=20 Done, had to relocate 4596 out of 9317 chunks pwm@europium:~$ sudo btrfs balance start -mconvert=dup,soft /mnt/snap_04/ Done, had to relocate 2 out of 4721 chunks pwm@europium:~$ sudo btrfs fi df /mnt/snap_04 Data, single: total=4.60TiB, used=4.59TiB System, DUP: total=40.00MiB, used=512.00KiB Metadata, DUP: total=6.50GiB, used=4.81GiB GlobalReserve, single: total=512.00MiB, used=0.00B pwm@europium:~$ sudo btrfs fi show /mnt/snap_04 Label: 'snap_04' uuid: c46df8fa-03db-4b32-8beb-5521d9931a31 Total devices 1 FS bytes used 4.60TiB devid1 size 9.09TiB used 4.61TiB path /dev/sdg1 So now device 1 usage is down from 9.09TiB to 4.61TiB. But if I test to fallocate() to grow the large parity file, I directly fail. I wrote a little help program that just focuses on fallocate() instead of having to run snapraid with lots of unknown additional actions being performed. Original file size is 5050486226944 bytes Trying to grow file to 5151751667712 bytes Failed fallocate [No space left on device] And result after shows 'used' have jumped up to 9.09TiB again. root@europium:/mnt# btrfs fi show snap_04 Label: 'snap_04' uuid: c46df8fa-03db-4b32-8beb-5521d9931a31 Total devices 1 FS bytes used 4.60TiB devid1 size 9.09TiB used 9.09TiB path /dev/sdg1 root@europium:/mnt# btrfs fi df /mnt/snap_04/ Data, single: total=9.08TiB, used=4.59TiB System, DUP: total=40.00MiB, used=992.00KiB Metadata, DUP: total=6.50GiB, used=4.81GiB GlobalReserve, single: total=512.00MiB, used=0.00B It's almost like the file system have decided that it needs to make a snapshot and store two complete copies of the complete file, which is obviously not going to work with a file larger than 50% of the file system. I think I _might_ understand what's going on here. Is that test program calling fallocate using the desired total size of the file, or just trying to allocate the range beyond the end to extend the file? I've seen issues with the first case on BTRFS before, and I'm starting to think that it might actually be trying to allocate the exact amount of space requested by fallocate, even if part of the range is already allocated space. OK, I just did a dead simple test by hand, and it looks like I was right. The method I used to check this is as follows: 1. Create and mount a reasonably small filesystem (I used an 8G temporary LV for this, a file would work too though). 2. Using dd or a similar tool, create a test file that takes up half of the size of the filesystem. It is important that this _not_ be fallocated, but just written out. 3. Use `fallocate -l` to try and extend the size of the file beyond half the size of the filesystem. For BTRFS, this will result in -ENOSPC, while for ext4 and XFS, it will succeed with no error. Based on this and some low-level inspection, it looks like BTRFS treats the full range of the fallocate call as unallocated, and thus is trying to allocate space for regions of that range that are already allocated. No issue at all to grow the parity file on the other parity disk. And that's why I wonder if there is some undetected file system corruption. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Massive loss of disk space
On 2017-08-01 10:39, pwm wrote: Thanks for the links and suggestions. I did try your suggestions but it didn't solve the underlying problem. pwm@europium:~$ sudo btrfs balance start -v -dusage=20 /mnt/snap_04 Dumping filters: flags 0x1, state 0x0, force is off DATA (flags 0x2): balancing, usage=20 Done, had to relocate 4596 out of 9317 chunks pwm@europium:~$ sudo btrfs balance start -mconvert=dup,soft /mnt/snap_04/ Done, had to relocate 2 out of 4721 chunks pwm@europium:~$ sudo btrfs fi df /mnt/snap_04 Data, single: total=4.60TiB, used=4.59TiB System, DUP: total=40.00MiB, used=512.00KiB Metadata, DUP: total=6.50GiB, used=4.81GiB GlobalReserve, single: total=512.00MiB, used=0.00B pwm@europium:~$ sudo btrfs fi show /mnt/snap_04 Label: 'snap_04' uuid: c46df8fa-03db-4b32-8beb-5521d9931a31 Total devices 1 FS bytes used 4.60TiB devid1 size 9.09TiB used 4.61TiB path /dev/sdg1 So now device 1 usage is down from 9.09TiB to 4.61TiB. But if I test to fallocate() to grow the large parity file, I directly fail. I wrote a little help program that just focuses on fallocate() instead of having to run snapraid with lots of unknown additional actions being performed. Original file size is 5050486226944 bytes Trying to grow file to 5151751667712 bytes Failed fallocate [No space left on device] And result after shows 'used' have jumped up to 9.09TiB again. root@europium:/mnt# btrfs fi show snap_04 Label: 'snap_04' uuid: c46df8fa-03db-4b32-8beb-5521d9931a31 Total devices 1 FS bytes used 4.60TiB devid1 size 9.09TiB used 9.09TiB path /dev/sdg1 root@europium:/mnt# btrfs fi df /mnt/snap_04/ Data, single: total=9.08TiB, used=4.59TiB System, DUP: total=40.00MiB, used=992.00KiB Metadata, DUP: total=6.50GiB, used=4.81GiB GlobalReserve, single: total=512.00MiB, used=0.00B It's almost like the file system have decided that it needs to make a snapshot and store two complete copies of the complete file, which is obviously not going to work with a file larger than 50% of the file system. I think I _might_ understand what's going on here. Is that test program calling fallocate using the desired total size of the file, or just trying to allocate the range beyond the end to extend the file? I've seen issues with the first case on BTRFS before, and I'm starting to think that it might actually be trying to allocate the exact amount of space requested by fallocate, even if part of the range is already allocated space. No issue at all to grow the parity file on the other parity disk. And that's why I wonder if there is some undetected file system corruption. /Per W On Tue, 1 Aug 2017, Hugo Mills wrote: Hi, Per, Start here: https://btrfs.wiki.kernel.org/index.php/FAQ#if_your_device_is_large_.28.3E16GiB.29 In your case, I'd suggest using "-dusage=20" to start with, as it'll probably free up quite a lot of your existing allocation. And this may also be of interest, in how to read the output of the tools: https://btrfs.wiki.kernel.org/index.php/FAQ#Understanding_free_space.2C_using_the_original_tools Finally, I note that you've still got some "single" chunks present for metadata. It won't affect your space allocation issues, but I would recommend getting rid of them anyway: # btrfs balance start -mconvert=dup,soft Hugo. On Tue, Aug 01, 2017 at 01:43:23PM +0200, pwm wrote: I have a 10TB file system with a parity file for a snapraid. However, I can suddenly not extend the parity file despite the file system only being about 50% filled - I should have 5TB of unallocated space. When trying to extend the parity file, fallocate() just returns ENOSPC, i.e. that the disk is full. Machine was originally a Debian 8 (Jessie) but after I detected the issue and no btrfs tool did show any errors, I have updated to Debian 9 (Snatch) to get a newer kernel and newer btrfs tools. pwm@europium:/mnt$ btrfs --version btrfs-progs v4.7.3 pwm@europium:/mnt$ uname -a Linux europium 4.9.0-3-amd64 #1 SMP Debian 4.9.30-2+deb9u2 (2017-06-26) x86_64 GNU/Linux pwm@europium:/mnt/snap_04$ ls -l total 4932703608 -rw--- 1 root root 319148889 Jul 8 04:21 snapraid.content -rw--- 1 root root 283115520 Aug 1 04:08 snapraid.content.tmp -rw--- 1 root root 5050486226944 Jul 31 17:14 snapraid.parity pwm@europium:/mnt/snap_04$ df . Filesystem 1K-blocks Used Available Use% Mounted on /dev/sdg1 9766434816 4944614648 4819831432 51% /mnt/snap_04 pwm@europium:/mnt/snap_04$ sudo btrfs fi show . Label: 'snap_04' uuid: c46df8fa-03db-4b32-8beb-5521d9931a31 Total devices 1 FS bytes used 4.60TiB devid1 size 9.09TiB used 9.09TiB path /dev/sdg1 Compare this with the second snapraid parity disk: pwm@europium:/mnt/snap_04$ sudo btrfs fi show /mnt/snap_05/ Label: 'snap_05' uuid: bac477e3-e78c-43ee-8402-6bdfff194567 Total devices 1 FS bytes used 4.69TiB devid1 size 9.09TiB used 4.70TiB path
Re: Massive loss of disk space
Thanks for the links and suggestions. I did try your suggestions but it didn't solve the underlying problem. pwm@europium:~$ sudo btrfs balance start -v -dusage=20 /mnt/snap_04 Dumping filters: flags 0x1, state 0x0, force is off DATA (flags 0x2): balancing, usage=20 Done, had to relocate 4596 out of 9317 chunks pwm@europium:~$ sudo btrfs balance start -mconvert=dup,soft /mnt/snap_04/ Done, had to relocate 2 out of 4721 chunks pwm@europium:~$ sudo btrfs fi df /mnt/snap_04 Data, single: total=4.60TiB, used=4.59TiB System, DUP: total=40.00MiB, used=512.00KiB Metadata, DUP: total=6.50GiB, used=4.81GiB GlobalReserve, single: total=512.00MiB, used=0.00B pwm@europium:~$ sudo btrfs fi show /mnt/snap_04 Label: 'snap_04' uuid: c46df8fa-03db-4b32-8beb-5521d9931a31 Total devices 1 FS bytes used 4.60TiB devid1 size 9.09TiB used 4.61TiB path /dev/sdg1 So now device 1 usage is down from 9.09TiB to 4.61TiB. But if I test to fallocate() to grow the large parity file, I directly fail. I wrote a little help program that just focuses on fallocate() instead of having to run snapraid with lots of unknown additional actions being performed. Original file size is 5050486226944 bytes Trying to grow file to 5151751667712 bytes Failed fallocate [No space left on device] And result after shows 'used' have jumped up to 9.09TiB again. root@europium:/mnt# btrfs fi show snap_04 Label: 'snap_04' uuid: c46df8fa-03db-4b32-8beb-5521d9931a31 Total devices 1 FS bytes used 4.60TiB devid1 size 9.09TiB used 9.09TiB path /dev/sdg1 root@europium:/mnt# btrfs fi df /mnt/snap_04/ Data, single: total=9.08TiB, used=4.59TiB System, DUP: total=40.00MiB, used=992.00KiB Metadata, DUP: total=6.50GiB, used=4.81GiB GlobalReserve, single: total=512.00MiB, used=0.00B It's almost like the file system have decided that it needs to make a snapshot and store two complete copies of the complete file, which is obviously not going to work with a file larger than 50% of the file system. No issue at all to grow the parity file on the other parity disk. And that's why I wonder if there is some undetected file system corruption. /Per W On Tue, 1 Aug 2017, Hugo Mills wrote: Hi, Per, Start here: https://btrfs.wiki.kernel.org/index.php/FAQ#if_your_device_is_large_.28.3E16GiB.29 In your case, I'd suggest using "-dusage=20" to start with, as it'll probably free up quite a lot of your existing allocation. And this may also be of interest, in how to read the output of the tools: https://btrfs.wiki.kernel.org/index.php/FAQ#Understanding_free_space.2C_using_the_original_tools Finally, I note that you've still got some "single" chunks present for metadata. It won't affect your space allocation issues, but I would recommend getting rid of them anyway: # btrfs balance start -mconvert=dup,soft Hugo. On Tue, Aug 01, 2017 at 01:43:23PM +0200, pwm wrote: I have a 10TB file system with a parity file for a snapraid. However, I can suddenly not extend the parity file despite the file system only being about 50% filled - I should have 5TB of unallocated space. When trying to extend the parity file, fallocate() just returns ENOSPC, i.e. that the disk is full. Machine was originally a Debian 8 (Jessie) but after I detected the issue and no btrfs tool did show any errors, I have updated to Debian 9 (Snatch) to get a newer kernel and newer btrfs tools. pwm@europium:/mnt$ btrfs --version btrfs-progs v4.7.3 pwm@europium:/mnt$ uname -a Linux europium 4.9.0-3-amd64 #1 SMP Debian 4.9.30-2+deb9u2 (2017-06-26) x86_64 GNU/Linux pwm@europium:/mnt/snap_04$ ls -l total 4932703608 -rw--- 1 root root 319148889 Jul 8 04:21 snapraid.content -rw--- 1 root root 283115520 Aug 1 04:08 snapraid.content.tmp -rw--- 1 root root 5050486226944 Jul 31 17:14 snapraid.parity pwm@europium:/mnt/snap_04$ df . Filesystem 1K-blocks Used Available Use% Mounted on /dev/sdg1 9766434816 4944614648 4819831432 51% /mnt/snap_04 pwm@europium:/mnt/snap_04$ sudo btrfs fi show . Label: 'snap_04' uuid: c46df8fa-03db-4b32-8beb-5521d9931a31 Total devices 1 FS bytes used 4.60TiB devid1 size 9.09TiB used 9.09TiB path /dev/sdg1 Compare this with the second snapraid parity disk: pwm@europium:/mnt/snap_04$ sudo btrfs fi show /mnt/snap_05/ Label: 'snap_05' uuid: bac477e3-e78c-43ee-8402-6bdfff194567 Total devices 1 FS bytes used 4.69TiB devid1 size 9.09TiB used 4.70TiB path /dev/sdi1 So on one parity disk, devid is 9.09TiB used - on the other only 4.70TiB. While almost the same amount of file system usage. And almost identical usage pattern. It's an archival RAID, so there is hardly any writes to the parity files because there are almost no file changes to the data files. The main usage is that the parity file gets extended when one of the data disks reaches a new high water mark. The only file that gets regularly rewritten is the snapraid.content file
Re: Btrfs + compression = slow performance and high cpu usage
> Peter, I don't think the filefrag is showing the correct > fragmentation status of the file when the compression is used. As reported on a previous message the output of 'filefrag -v' which can be used to see what is going on: filefrag /mnt/sde3/testfile /mnt/sde3/testfile: 49287 extents found Most the latter extents are mercifully rather contiguous, their size is just limited by the compression code, here is an extract from 'filefrag -v' from around the middle: 24757: 1321888.. 1321919: 11339579.. 11339610: 32: 11339594: 24758: 1321920.. 1321951: 11339597.. 11339628: 32: 11339611: 24759: 1321952.. 1321983: 11339615.. 11339646: 32: 11339629: 24760: 1321984.. 1322015: 11339632.. 11339663: 32: 11339647: 24761: 1322016.. 1322047: 11339649.. 11339680: 32: 11339664: 24762: 1322048.. 1322079: 11339667.. 11339698: 32: 11339681: 24763: 1322080.. 1322111: 11339686.. 11339717: 32: 11339699: 24764: 1322112.. 1322143: 11339703.. 11339734: 32: 11339718: 24765: 1322144.. 1322175: 11339720.. 11339751: 32: 11339735: 24766: 1322176.. 1322207: 11339737.. 11339768: 32: 11339752: 24767: 1322208.. 1322239: 11339754.. 11339785: 32: 11339769: 24768: 1322240.. 1322271: 11339771.. 11339802: 32: 11339786: 24769: 1322272.. 1322303: 11339789.. 11339820: 32: 11339803: But again this is on a fresh empty Btrfs volume. As I wrote, "their size is just limited by the compression code" which results in "128KiB writes". On a "fresh empty Btrfs volume" the compressed extents limited to 128KiB also happen to be pretty physically contiguous, but on a more fragmented free space list they can be more scattered. As I already wrote the main issue here seems to be that we are talking about a "RAID5 with 128KiB writes and a 768KiB stripe size". On MD RAID5 the slowdown because of RMW seems only to be around 30-40%, but it looks like that several back-to-back 128KiB writes get merged by the Linux IO subsystem (not sure whether that's thoroughly legal), and perhaps they get merged by the 3ware firmware only if it has a persistent cache, and maybe your 3ware does not have one, but you have kept your counsel as to that. My impression is that you read the Btrfs documentation and my replies with a lot less attention than I write them. Some of the things you have done and said make me think that you did not read https://btrfs.wiki.kernel.org/index.php/Compression and 'man 5 btrfs', for example: "How does compression interact with direct IO or COW? Compression does not work with DIO, does work with COW and does not work for NOCOW files. If a file is opened in DIO mode, it will fall back to buffered IO. Are there speed penalties when doing random access to a compressed file? Yes. The compression processes ranges of a file of maximum size 128 KiB and compresses each 4 KiB (or page-sized) block separately." > I am currently defragmenting that mountpoint, ensuring that > everrything is compressed with zlib. Defragmenting the used space might help find more contiguous allocations. > p.s. any other suggestion that might help with the fragmentation > and data allocation. Should I try and rebalance the data on the > drive? Yes, regularly, as that defragments the unused space. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs incremental send | receive fails with Error: File not found
OK. The problem was that the original subvolume had a "Received UUID". This caused all subsequent snapshots to have the same Received UUID which messes up Btrfs send | receive. Of course this means I must have used btrfs send | receive to create that subvolume and then turned it r/w at some point, though I cannot remember ever doing this. Perhaps a clear notice "WARNING: make sure that the source subvolume does not have a Received UUID" on the Wiki would be helpful? Both on https://btrfs.wiki.kernel.org/index.php/Incremental_Backup and on https://btrfs.wiki.kernel.org/index.php/Manpage/btrfs-property Regards, A On 7/28/2017 9:32 PM, Hermann Schwärzler wrote: Hi for me it looks like those snapshots are not read-only. But as far as I know for using send they have to be. At least https://btrfs.wiki.kernel.org/index.php/Incremental_Backup#Initial_Bootstrapping states "We will need to create a read-only snapshot ,,," I am using send/receive (with read-only snapshots) on a regular basis and never had a problem like yours. What are the commands you use to create your snapshots? Greetings Hermann On 07/28/2017 07:26 PM, A L wrote: I often hit the following error when doing incremental btrfs send-receive: Btrfs incremental send | receive fails with Error: File not found Sometimes I can do two-three incremental snapshots, but then the same error (different file) happens again. It seems that the files were changed or replaced between snapshots, which is causing the problems for send-receive. I have tried to delete all snapshots and started over but the problem comes back, so I think it must be a bug. The source volume is: /mnt/storagePool (with RAID1 profile) with subvolume: volume/userData Backup disk is: /media/usb-backup (external USB disk) [...] -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Slow mounting raid1
On Tue, Aug 1, 2017 at 2:43 AM, Leonidas Spyropouloswrote: > Hi Duncan, > > Thanks for your answer In general I think btrfs takes time proportional to the size of your metadata to mount. Bigger and/or fragmented metadata leads to longer mount times. My big backup fs with >300GB of metadata takes over 20minutes to mount, and that's with the space tree which is significantly faster then space cache v1. >> >> If a device takes too long and times out you'll see resets and the like >> in dmesg, but that normally starts at ~30 seconds, not the 5 seconds you >> mention. Still, doesn't hurt to check. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Massive loss of disk space
Hi, Per, Start here: https://btrfs.wiki.kernel.org/index.php/FAQ#if_your_device_is_large_.28.3E16GiB.29 In your case, I'd suggest using "-dusage=20" to start with, as it'll probably free up quite a lot of your existing allocation. And this may also be of interest, in how to read the output of the tools: https://btrfs.wiki.kernel.org/index.php/FAQ#Understanding_free_space.2C_using_the_original_tools Finally, I note that you've still got some "single" chunks present for metadata. It won't affect your space allocation issues, but I would recommend getting rid of them anyway: # btrfs balance start -mconvert=dup,soft Hugo. On Tue, Aug 01, 2017 at 01:43:23PM +0200, pwm wrote: > I have a 10TB file system with a parity file for a snapraid. > However, I can suddenly not extend the parity file despite the file > system only being about 50% filled - I should have 5TB of > unallocated space. When trying to extend the parity file, > fallocate() just returns ENOSPC, i.e. that the disk is full. > > Machine was originally a Debian 8 (Jessie) but after I detected the > issue and no btrfs tool did show any errors, I have updated to > Debian 9 (Snatch) to get a newer kernel and newer btrfs tools. > > pwm@europium:/mnt$ btrfs --version > btrfs-progs v4.7.3 > pwm@europium:/mnt$ uname -a > Linux europium 4.9.0-3-amd64 #1 SMP Debian 4.9.30-2+deb9u2 > (2017-06-26) x86_64 GNU/Linux > > > > > pwm@europium:/mnt/snap_04$ ls -l > total 4932703608 > -rw--- 1 root root 319148889 Jul 8 04:21 snapraid.content > -rw--- 1 root root 283115520 Aug 1 04:08 snapraid.content.tmp > -rw--- 1 root root 5050486226944 Jul 31 17:14 snapraid.parity > > > > pwm@europium:/mnt/snap_04$ df . > Filesystem 1K-blocks Used Available Use% Mounted on > /dev/sdg1 9766434816 4944614648 4819831432 51% /mnt/snap_04 > > > > pwm@europium:/mnt/snap_04$ sudo btrfs fi show . > Label: 'snap_04' uuid: c46df8fa-03db-4b32-8beb-5521d9931a31 > Total devices 1 FS bytes used 4.60TiB > devid1 size 9.09TiB used 9.09TiB path /dev/sdg1 > > Compare this with the second snapraid parity disk: > pwm@europium:/mnt/snap_04$ sudo btrfs fi show /mnt/snap_05/ > Label: 'snap_05' uuid: bac477e3-e78c-43ee-8402-6bdfff194567 > Total devices 1 FS bytes used 4.69TiB > devid1 size 9.09TiB used 4.70TiB path /dev/sdi1 > > So on one parity disk, devid is 9.09TiB used - on the other only 4.70TiB. > While almost the same amount of file system usage. And almost > identical usage pattern. It's an archival RAID, so there is hardly > any writes to the parity files because there are almost no file > changes to the data files. The main usage is that the parity file > gets extended when one of the data disks reaches a new high water > mark. > > The only file that gets regularly rewritten is the snapraid.content > file that gets regenerated after every scrub. > > > > pwm@europium:/mnt/snap_04$ sudo btrfs fi df . > Data, single: total=9.08TiB, used=4.59TiB > System, DUP: total=8.00MiB, used=992.00KiB > System, single: total=4.00MiB, used=0.00B > Metadata, DUP: total=6.00GiB, used=4.81GiB > Metadata, single: total=8.00MiB, used=0.00B > GlobalReserve, single: total=512.00MiB, used=0.00B > > > > pwm@europium:/mnt/snap_04$ sudo btrfs filesystem du . > Total Exclusive Set shared Filename >4.59TiB 4.59TiB - ./snapraid.parity > 304.37MiB 304.37MiB - ./snapraid.content > 270.00MiB 270.00MiB - ./snapraid.content.tmp >4.59TiB 4.59TiB 0.00B . > > > > pwm@europium:/mnt/snap_04$ sudo btrfs filesystem usage . > Overall: > Device size: 9.09TiB > Device allocated: 9.09TiB > Device unallocated: 0.00B > Device missing: 0.00B > Used: 4.60TiB > Free (estimated): 4.49TiB (min: 4.49TiB) > Data ratio: 1.00 > Metadata ratio: 2.00 > Global reserve: 512.00MiB (used: 0.00B) > > Data,single: Size:9.08TiB, Used:4.59TiB >/dev/sdg1 9.08TiB > > Metadata,single: Size:8.00MiB, Used:0.00B >/dev/sdg1 8.00MiB > > Metadata,DUP: Size:6.00GiB, Used:4.81GiB >/dev/sdg1 12.00GiB > > System,single: Size:4.00MiB, Used:0.00B >/dev/sdg1 4.00MiB > > System,DUP: Size:8.00MiB, Used:992.00KiB >/dev/sdg1 16.00MiB > > Unallocated: >/dev/sdg1 0.00B > > > > pwm@europium:~$ sudo btrfs check /dev/sdg1 > Checking filesystem on /dev/sdg1 > UUID: c46df8fa-03db-4b32-8beb-5521d9931a31 > checking extents > checking free space cache > checking fs roots > checking csums > checking root refs > found 5057294639104 bytes used err is 0 > total csum bytes: 4529856120 > total tree bytes: 5170151424 > total fs tree bytes: 178700288 > total extent tree bytes: 209616896 > btree space waste bytes: 182357204 > file data blocks
Massive loss of disk space
I have a 10TB file system with a parity file for a snapraid. However, I can suddenly not extend the parity file despite the file system only being about 50% filled - I should have 5TB of unallocated space. When trying to extend the parity file, fallocate() just returns ENOSPC, i.e. that the disk is full. Machine was originally a Debian 8 (Jessie) but after I detected the issue and no btrfs tool did show any errors, I have updated to Debian 9 (Snatch) to get a newer kernel and newer btrfs tools. pwm@europium:/mnt$ btrfs --version btrfs-progs v4.7.3 pwm@europium:/mnt$ uname -a Linux europium 4.9.0-3-amd64 #1 SMP Debian 4.9.30-2+deb9u2 (2017-06-26) x86_64 GNU/Linux pwm@europium:/mnt/snap_04$ ls -l total 4932703608 -rw--- 1 root root 319148889 Jul 8 04:21 snapraid.content -rw--- 1 root root 283115520 Aug 1 04:08 snapraid.content.tmp -rw--- 1 root root 5050486226944 Jul 31 17:14 snapraid.parity pwm@europium:/mnt/snap_04$ df . Filesystem 1K-blocks Used Available Use% Mounted on /dev/sdg1 9766434816 4944614648 4819831432 51% /mnt/snap_04 pwm@europium:/mnt/snap_04$ sudo btrfs fi show . Label: 'snap_04' uuid: c46df8fa-03db-4b32-8beb-5521d9931a31 Total devices 1 FS bytes used 4.60TiB devid1 size 9.09TiB used 9.09TiB path /dev/sdg1 Compare this with the second snapraid parity disk: pwm@europium:/mnt/snap_04$ sudo btrfs fi show /mnt/snap_05/ Label: 'snap_05' uuid: bac477e3-e78c-43ee-8402-6bdfff194567 Total devices 1 FS bytes used 4.69TiB devid1 size 9.09TiB used 4.70TiB path /dev/sdi1 So on one parity disk, devid is 9.09TiB used - on the other only 4.70TiB. While almost the same amount of file system usage. And almost identical usage pattern. It's an archival RAID, so there is hardly any writes to the parity files because there are almost no file changes to the data files. The main usage is that the parity file gets extended when one of the data disks reaches a new high water mark. The only file that gets regularly rewritten is the snapraid.content file that gets regenerated after every scrub. pwm@europium:/mnt/snap_04$ sudo btrfs fi df . Data, single: total=9.08TiB, used=4.59TiB System, DUP: total=8.00MiB, used=992.00KiB System, single: total=4.00MiB, used=0.00B Metadata, DUP: total=6.00GiB, used=4.81GiB Metadata, single: total=8.00MiB, used=0.00B GlobalReserve, single: total=512.00MiB, used=0.00B pwm@europium:/mnt/snap_04$ sudo btrfs filesystem du . Total Exclusive Set shared Filename 4.59TiB 4.59TiB - ./snapraid.parity 304.37MiB 304.37MiB - ./snapraid.content 270.00MiB 270.00MiB - ./snapraid.content.tmp 4.59TiB 4.59TiB 0.00B . pwm@europium:/mnt/snap_04$ sudo btrfs filesystem usage . Overall: Device size: 9.09TiB Device allocated: 9.09TiB Device unallocated: 0.00B Device missing: 0.00B Used: 4.60TiB Free (estimated): 4.49TiB (min: 4.49TiB) Data ratio: 1.00 Metadata ratio: 2.00 Global reserve: 512.00MiB (used: 0.00B) Data,single: Size:9.08TiB, Used:4.59TiB /dev/sdg1 9.08TiB Metadata,single: Size:8.00MiB, Used:0.00B /dev/sdg1 8.00MiB Metadata,DUP: Size:6.00GiB, Used:4.81GiB /dev/sdg1 12.00GiB System,single: Size:4.00MiB, Used:0.00B /dev/sdg1 4.00MiB System,DUP: Size:8.00MiB, Used:992.00KiB /dev/sdg1 16.00MiB Unallocated: /dev/sdg1 0.00B pwm@europium:~$ sudo btrfs check /dev/sdg1 Checking filesystem on /dev/sdg1 UUID: c46df8fa-03db-4b32-8beb-5521d9931a31 checking extents checking free space cache checking fs roots checking csums checking root refs found 5057294639104 bytes used err is 0 total csum bytes: 4529856120 total tree bytes: 5170151424 total fs tree bytes: 178700288 total extent tree bytes: 209616896 btree space waste bytes: 182357204 file data blocks allocated: 5073330888704 referenced 5052040339456 pwm@europium:~$ sudo btrfs scrub status /mnt/snap_04/ scrub status for c46df8fa-03db-4b32-8beb-5521d9931a31 scrub started at Mon Jul 31 21:26:50 2017 and finished after 06:53:47 total bytes scrubbed: 4.60TiB with 0 errors So where have my 5TB disk space gone lost? And what should I do to be able to get it back again? I could obviously reformat the partition and rebuild the parity since I still have one good parity, but that doesn't feel like a good route. It isn't impossible this might happen again. /Per W -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Btrfs + compression = slow performance and high cpu usage
> -Original Message- > From: linux-btrfs-ow...@vger.kernel.org [mailto:linux-btrfs- > ow...@vger.kernel.org] On Behalf Of Konstantin V. Gavrilenko > Sent: Tuesday, 1 August 2017 7:58 PM > To: Peter Grandi> Cc: Linux fs Btrfs > Subject: Re: Btrfs + compression = slow performance and high cpu usage > > Peter, I don't think the filefrag is showing the correct fragmentation status > of > the file when the compression is used. > At least the one that is installed by default in Ubuntu 16.04 - e2fsprogs | > 1.42.13-1ubuntu1 > > So for example, fragmentation of compressed file is 320 times more then > uncompressed one. > > root@homenas:/mnt/storage/NEW# filefrag test5g-zeroes > test5g-zeroes: 40903 extents found > > root@homenas:/mnt/storage/NEW# filefrag test5g-data > test5g-data: 129 extents found Compressed extents are about 128kb, uncompressed extents are about 128Mb. (can't remember the exact numbers.) I've had trouble with slow filesystems when using compression. The problem seems to go away when removing compression. Paul.
[PATCH] btrfs: copy fsid to super_block s_uuid
We didn't copy fsid to struct super_block.s_uuid so Overlay disables index feature with btrfs as the lower FS. kernel: overlayfs: fs on '/lower' does not support file handles, falling back to index=off. Fix this by publishing the fsid through struct super_block.s_uuid. Signed-off-by: Anand Jain--- I tried to know if in case did we deliberately missed this for some reason, however there is no information on that. If we mount a non-default subvol in the next mount/remount, its still the same FS, so publishing the FSID instead of subvol uuid is correct, OR I can't think any other reason for not using s_uuid for btrfs. fs/btrfs/disk-io.c | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 080e2ebb8aa0..b7e72d040442 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -2899,6 +2899,7 @@ int open_ctree(struct super_block *sb, sb->s_blocksize = sectorsize; sb->s_blocksize_bits = blksize_bits(sectorsize); + memcpy(>s_uuid, fs_info->fsid, BTRFS_FSID_SIZE); mutex_lock(_info->chunk_mutex); ret = btrfs_read_sys_array(fs_info); -- 2.13.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs + compression = slow performance and high cpu usage
Peter, I don't think the filefrag is showing the correct fragmentation status of the file when the compression is used. At least the one that is installed by default in Ubuntu 16.04 - e2fsprogs | 1.42.13-1ubuntu1 So for example, fragmentation of compressed file is 320 times more then uncompressed one. root@homenas:/mnt/storage/NEW# filefrag test5g-zeroes test5g-zeroes: 40903 extents found root@homenas:/mnt/storage/NEW# filefrag test5g-data test5g-data: 129 extents found I am currently defragmenting that mountpoint, ensuring that everrything is compressed with zlib. # btrfs fi defragment -rv -czlib /mnt/arh-backup my guess is that it will take another 24-36 hours to complete and then I will redo the test to see if that has helped. will keep the list posted. p.s. any other suggestion that might help with the fragmentation and data allocation. Should I try and rebalance the data on the drive? kos - Original Message - From: "Peter Grandi"To: "Linux fs Btrfs" Sent: Monday, 31 July, 2017 1:41:07 PM Subject: Re: Btrfs + compression = slow performance and high cpu usage [ ... ] > grep 'model name' /proc/cpuinfo | sort -u > model name : Intel(R) Xeon(R) CPU E5645 @ 2.40GHz Good, contemporary CPU with all accelerations. > The sda device is a hardware RAID5 consisting of 4x8TB drives. [ ... ] > Strip Size : 256 KB So the full RMW data stripe length is 768KiB. > [ ... ] don't see the previously reported behaviour of one of > the kworker consuming 100% of the cputime, but the write speed > difference between the compression ON vs OFF is pretty large. That's weird; of course 'lzo' is a lot cheaper than 'zlib', but in my test the much higher CPU time of the latter was spread across many CPUs, while in your case it wasn't, even if the E5645 has 6 CPUs and can do 12 threads. That seemed to point to some high cost of finding free blocks, that is a very fragmented free list, or something else. > dd if=/dev/sdb of=./testing count=5120 bs=1M status=progress oflag=direct > 5368709120 bytes (5.4 GB, 5.0 GiB) copied, 26.0685 s, 206 MB/s The results with 'oflag=direct' are not relevant, because Btrfs behaves "differently" with that. > mountflags: > (rw,relatime,compress-force=zlib,space_cache=v2,subvolid=5,subvol=/) [ ... ] > dd if=/dev/sdb of=./testing count=5120 bs=1M status=progress conv=fsync > 5368709120 bytes (5.4 GB, 5.0 GiB) copied, 77.4845 s, 69.3 MB/s > mountflags: > (rw,relatime,compress-force=lzo,space_cache=v2,subvolid=5,subvol=/) [ ... ] > dd if=/dev/sdb of=./testing count=5120 bs=1M status=progress conv=fsync > 5368709120 bytes (5.4 GB, 5.0 GiB) copied, 122.321 s, 43.9 MB/s That's pretty good for a RAID5 with 128KiB writes and a 768KiB stripe size, on a 3ware, and looks like that the hw host adapter does not have a persistent cache (battery backed usually). My guess that watching transfer rates and latencies with 'iostat -dk -zyx 1' did not happen. > mountflags: (rw,relatime,space_cache=v2,subvolid=5,subvol=/) [ ... ] > dd if=/dev/sdb of=./testing count=5120 bs=1M status=progress conv=fsync > 5368709120 bytes (5.4 GB, 5.0 GiB) copied, 10.1033 s, 531 MB/s I had mentioned in my previous reply the output of 'filefrag'. That to me seems relevant here, because of RAID5 RMW and maximum extent size with Brfs compression and strip/stripe size. Perhaps redoing the tests with a 128KiB 'bs' *without* compression would be interesting, perhaps even with 'oflag=sync' instead of 'conv=fsync'. It is hard for me to see a speed issue here with Btrfs: for comparison I have done a simple test with a both a 3+1 MD RAID5 set with a 256KiB chunk size and a single block device on "contemporary" 1T/2TB drives, capable of sequential transfer rates of 150-190MB/s: soft# grep -A2 sdb3 /proc/mdstat md127 : active raid5 sde3[4] sdd3[2] sdc3[1] sdb3[0] 729808128 blocks super 1.0 level 5, 256k chunk, algorithm 2 [4/4] [] with compression: soft# mount -t btrfs -o commit=10,compress-force=zlib /dev/md/test5 /mnt/test5 soft# mount -t btrfs -o commit=10,compress-force=zlib /dev/sdg3 /mnt/sdg3 soft# rm -f /mnt/test5/testfile /mnt/sdg3/testfile soft# /usr/bin/time dd iflag=fullblock if=/dev/sda6 of=/mnt/test5/testfile bs=1M count=1 conv=fsync 1+0 records in 1+0 records out 1048576 bytes (10 GB) copied, 94.3605 s, 111 MB/s 0.01user 12.59system 1:34.36elapsed 13%CPU (0avgtext+0avgdata 2932maxresident)k 13042144inputs+20482144outputs (3major+345minor)pagefaults 0swaps soft# /usr/bin/time dd iflag=fullblock if=/dev/sda6 of=/mnt/sdg3/testfile bs=1M count=1 conv=fsync 1+0 records in 1+0 records out 1048576 bytes (10 GB) copied, 93.5885 s, 112 MB/s 0.03user 12.35system 1:33.59elapsed 13%CPU (0avgtext+0avgdata 2940maxresident)k 13042144inputs+20482400outputs
Re: Slow mounting raid1
Hi Duncan, Thanks for your answer On 01/08/17, Duncan wrote: > > If you're doing any snapshotting, you almost certainly want noatime, not > the default relatime. Even without snapshotting and regardless of the > filesystem, tho on btrfs it's a bigger factor due to COW, noatime is a > recommended performance optimization. > > The biggest caveat with that is if you're running something that actually > depends on atime. Few if any modern applications depend on atime, with > mutt in some configurations being an older application that still does. > But AFAIK it only does in some configurations... The array has no snapshots and my mutt resides on a diff SSD btrfs so I can safely try this option. > > Is there anything suspect in dmesg during the mount? What does smartctl > say about the health of the devices? (smartctl -AH at least, the selftest > data is unlikely to be useful unless you actually run the selftests.) > dmesg while mount says: [19823.896790] BTRFS info (device sde): use lzo compression [19823.896798] BTRFS info (device sde): disk space caching is enabled [19823.896800] BTRFS info (device sde): has skinny extents Smartctl tests are scheduled to run all disks once every day (for short test) and every week for long tests. smartctl output: # smartctl -AH /dev/sdd smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.11.12-1-ck] (local build) Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 100 100 016Pre-fail Always - 0 2 Throughput_Performance 0x0005 143 143 054Pre-fail Offline - 67 3 Spin_Up_Time0x0007 124 124 024Pre-fail Always - 185 (Average 185) 4 Start_Stop_Count0x0012 100 100 000Old_age Always - 651 5 Reallocated_Sector_Ct 0x0033 100 100 005Pre-fail Always - 0 7 Seek_Error_Rate 0x000b 100 100 067Pre-fail Always - 0 8 Seek_Time_Performance 0x0005 110 110 020Pre-fail Offline - 36 9 Power_On_Hours 0x0012 100 100 000Old_age Always - 4594 10 Spin_Retry_Count0x0013 100 100 060Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 000Old_age Always - 353 192 Power-Off_Retract_Count 0x0032 094 094 000Old_age Always - 7671 193 Load_Cycle_Count0x0012 094 094 000Old_age Always - 7671 194 Temperature_Celsius 0x0002 162 162 000Old_age Always - 37 (Min/Max 17/62) 196 Reallocated_Event_Count 0x0032 100 100 000Old_age Always - 0 197 Current_Pending_Sector 0x0022 100 100 000Old_age Always - 0 198 Offline_Uncorrectable 0x0008 100 100 000Old_age Offline - 0 199 UDMA_CRC_Error_Count0x000a 200 200 000Old_age Always - 0 # smartctl -AH /dev/sde smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.11.12-1-ck] (local build) Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 100 100 016Pre-fail Always - 0 2 Throughput_Performance 0x0005 142 142 054Pre-fail Offline - 69 3 Spin_Up_Time0x0007 123 123 024Pre-fail Always - 186 (Average 187) 4 Start_Stop_Count0x0012 100 100 000Old_age Always - 709 5 Reallocated_Sector_Ct 0x0033 100 100 005Pre-fail Always - 0 7 Seek_Error_Rate 0x000b 100 100 067Pre-fail Always - 0 8 Seek_Time_Performance 0x0005 113 113 020Pre-fail Offline - 35 9 Power_On_Hours 0x0012 100 100 000Old_age Always - 4678 10 Spin_Retry_Count0x0013 100 100 060Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 000Old_age Always - 353 192 Power-Off_Retract_Count 0x0032 093 093 000Old_age Always - 8407 193 Load_Cycle_Count0x0012 093 093 000Old_age
Re: 4.11.6 / more corruption / root 15455 has a root item with a more recent gen (33682) compared to the found root node (0)
On Mon, Jul 31, 2017 at 03:00:53PM -0700, Justin Maggard wrote: > Marc, do you have quotas enabled? IIRC, you're a send/receive user. > The combination of quotas and btrfs receive can corrupt your > filesystem, as shown by the xfstest I sent to the list a little while > ago. Thanks for checking. I do not use quota given the problems I had with them early on over 2y ago. Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS error: bad tree block start 0 623771648
On Sun, 30 Jul 2017 18:14:35 +0200 "marcel.cochem"wrote: > I am pretty sure that not all data is lost as i can grep thorugh the > 100 GB SSD partition. But my question is, if there is a tool to rescue > all (intact) data and maybe have only a few corrupt files which can't > be recovered. There is such a tool, see https://btrfs.wiki.kernel.org/index.php/Restore -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS error: bad tree block start 0 623771648
On Mon, 31 Jul 2017 11:12:01 -0700 Liu Bowrote: > Superblock and chunk tree root is OK, looks like the header part of > the tree root is now all-zero, but I'm unable to think of a btrfs bug > which can lead to that (if there is, it is a serious enough one) I see that the FS is being mounted with "discard". So maybe it was a TRIM gone bad (wrong location or in a wrong sequence). Generally it appears to be not recommended to use "discard" by now (because of its performance impact, and maybe possible issues like this), instead schedule to call "fstrim " once a day or so, and/or on boot-up. > on ssd like disks, by default there is only one copy for metadata. Time and time again, the default of "single" metadata for SSD is a terrible idea. Most likely DUP metadata would save the FS in this case. -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html