Btrfs reserve metadata problem
Hi All, When testing Btrfs with fio 4k random write, I found that volume with smaller free space available has lower performance. It seems that the smaller the free space of volume is, the smaller amount of dirty page filesystem could have. There are only 6 MB dirty pages when free space of volume is only 10GB with 16KB-nodesize and cow disabled. btrfs will reserve metadata for every write. The amount to reserve is calculated as follows: nodesize * BTRFS_MAX_LEVEL(8) * 2, i.e., it reserves 256KB of metadata. The maximum amount of metadata reservation depends on size of metadata currently in used and free space within volume(free chunk size /16) When metadata reaches the limit, btrfs will need to flush the data to release the reservation. 1. Is there any logic behind the value (free chunk size /16) /* * If we have dup, raid1 or raid10 then only half of the free * space is actually useable. For raid56, the space info used * doesn't include the parity drive, so we don't have to * change the math */ if (profile & (BTRFS_BLOCK_GROUP_DUP | BTRFS_BLOCK_GROUP_RAID1 | BTRFS_BLOCK_GROUP_RAID10)) avail >>= 1; /* * If we aren't flushing all things, let us overcommit up to * 1/2th of the space. If we can flush, don't let us overcommit * too much, let it overcommit up to 1/8 of the space. */ if (flush == BTRFS_RESERVE_FLUSH_ALL) avail >>= 3; else avail >>= 1; 2. Is there any way to improve this problem? Thanks. Robbie Ko -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] btrfs-progs: fi-usage: Fix wrong RAID10 used and unallocated space
[BUG] For a very basic RAID10 with 4 disks, "fi usage" and "fi show" are outputting conflicting result: -- # btrfs fi show /mnt/btrfs/ Label: none uuid: 6d0229db-28d1-4696-ac20-e828cc45dc40 Total devices 4 FS bytes used 1.12MiB devid1 size 5.00GiB used 2.01GiB path /dev/mapper/data-disk1 devid2 size 5.00GiB used 2.01GiB path /dev/mapper/data-disk2 devid3 size 5.00GiB used 2.01GiB path /dev/mapper/data-disk3 devid4 size 5.00GiB used 2.01GiB path /dev/mapper/data-disk4 Here the unallocated space for disk4 should be a little less than 3G. (5.00 Gib - 2.01GiB) # btrfs fi usage /mnt/btrfs/ Overall: Device size: 20.00GiB Device allocated: 8.03GiB Device unallocated: 11.97GiB Device missing: 0.00B Used: 2.25MiB Free (estimated): 7.98GiB (min: 7.98GiB) Data ratio: 2.00 Metadata ratio: 2.00 Global reserve: 16.00MiB (used: 0.00B) Data,RAID10: Size:2.00GiB, Used:1.00MiB ... /dev/mapper/data-disk4512.00MiB Metadata,RAID10: Size:2.00GiB, Used:112.00KiB ... /dev/mapper/data-disk4512.00MiB System,RAID10: Size:16.00MiB, Used:16.00KiB ... /dev/mapper/data-disk4 4.00MiB Unallocated: ... /dev/mapper/data-disk4 4.00GiB While fi usage shows we still have 4.00GiB unallocated space. -- [CAUSE] calc_chunk_size() is used to convert chunk size to device extent size, which is used to get the per-device data/meta/sys used space. However, for RAID10 we just divide the chunk size with num_stripes, without taking sub stripes into account. Resulting data/meta/sys usage in RAID10 halved. [FIX] Take the missing sub stripes into consideration. Reported-by: Adam BaheSigned-off-by: Qu Wenruo --- cmds-fi-usage.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/cmds-fi-usage.c b/cmds-fi-usage.c index 0b0e47fee194..0af4bdbc9370 100644 --- a/cmds-fi-usage.c +++ b/cmds-fi-usage.c @@ -664,7 +664,7 @@ static u64 calc_chunk_size(struct chunk_info *ci) else if (ci->type & BTRFS_BLOCK_GROUP_RAID6) return ci->size / (ci->num_stripes -2); else if (ci->type & BTRFS_BLOCK_GROUP_RAID10) - return ci->size / ci->num_stripes; + return ci->size / (ci->num_stripes / 2); return ci->size; } -- 2.15.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: A Big Thank You, and some Notes on Current Recovery Tools.
Stirling Westrup posted on Mon, 01 Jan 2018 14:44:43 -0500 as excerpted: > In hind sight (which is always 20/20), I should have updated the backups > before starting to make my changes, but as I'd just added a new 4T drive > to the BTRFS RAID6 in my backup system a week before, and it went as > smooth as butter, I guess I was feeling insufficiently paranoid. Are you aware of btrfs raid56-mode history? If you're running a current enough kernel (wiki says 4.12 for raid56 mode, but you might want 4.14 for other fixes and/or the fact that it's LTS) the severest known raid56 issues that had it recommendation- blacklisted are fixed, but raid56 mode still doesn't have fixes for the infamous parity-raid write hole, and parities are not checksummed, in hindsight an implementation mistake as it breaks btrfs' otherwise integrity and checksumming guarantees, that's going to require an on-disk format change and some major work to fix. If you're running at least kernel 4.12 and are aware of and understand the remaining raid56 caveats, raid56 mode can be a valid choice, but if not, I strongly recommend doing more research to learn and understand those caveats, before relying too heavily on that backup. The most reliable and well tested btrfs multi-device mode remains raid1, tho that's expensive in terms of space required since it duplicates everything. For many devices, the recommendation seems to remain btrfs raid1, either straight, or on top of a pair of mdraid0s (or alike, dmraid0s, hardware raid0s, etc), since that performs better than btrfs raid10, and removes a confusing tho not harmful if properly understood layout ambiguity of btrfs raid10 as well. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 02/17] btrfs-progs: lowmem check: record returned errors after walk_down_tree_v2()
On 12/29/2017 07:17 PM, Nikolay Borisov wrote: On 20.12.2017 06:57, Su Yue wrote: In lowmem mode with '--repair', check_chunks_and_extents_v2() will fix accounting in block groups and clear the error bit BG_ACCOUNTING_ERROR. However, return value of check_btrfs_root() is 0 either 1 instead of error bits. If extent tree is on error, lowmem repair always prints error and returns nonzero value even the filesystem is fine after repair. So let @err contains bits after walk_down_tree_v2(). Introduce FATAL_ERROR for lowmem mode to represents negative return values since negative and positive can't not be mixed in bits operations. Signed-off-by: Su Yue--- cmds-check.c | 13 +++-- 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/cmds-check.c b/cmds-check.c index 309ac9553b3a..ebede26cef01 100644 --- a/cmds-check.c +++ b/cmds-check.c @@ -134,6 +134,7 @@ struct data_backref { #define DIR_INDEX_MISMATCH (1<<19) /* INODE_INDEX found but not match */ #define DIR_COUNT_AGAIN (1<<20) /* DIR isize should be recalculated */ #define BG_ACCOUNTING_ERROR (1<<21) /* Block group accounting error */ +#define FATAL_ERROR (1<<22) /* fatal bit for errno */ static inline struct data_backref* to_data_backref(struct extent_backref *back) { @@ -6556,7 +6557,7 @@ static struct data_backref *find_data_backref(struct extent_record *rec, *otherwise means check fs tree(s) items relationship and * @root MUST be a fs tree root. * Returns 0 represents OK. - * Returns not 0 represents error. + * Returns > 0represents error bits. */ What about the code in 'if (!check_all)' branch, check_fs_first_inode can return a negative value, hence check_btrfs_root can return a negative value. A negative value can also be returned from btrfs_search_slot. Clearly this patch needs to be thought out better OK, I will update it. Thanks for review. static int check_btrfs_root(struct btrfs_trans_handle *trans, struct btrfs_root *root, unsigned int ext_ref, @@ -6607,12 +6608,12 @@ static int check_btrfs_root(struct btrfs_trans_handle *trans, while (1) { ret = walk_down_tree_v2(trans, root, , , , ext_ref, check_all); - - err |= !!ret; + if (ret > 0) + err |= ret; /* if ret is negative, walk shall stop */ if (ret < 0) { - ret = err; + ret = err | FATAL_ERROR; break; } @@ -6636,12 +6637,12 @@ out: * @ext_ref: the EXTENDED_IREF feature * * Return 0 if no error found. - * Return <0 for error. + * Return not 0 for error. */ static int check_fs_root_v2(struct btrfs_root *root, unsigned int ext_ref) { reset_cached_block_groups(root->fs_info); - return check_btrfs_root(NULL, root, ext_ref, 0); + return !!check_btrfs_root(NULL, root, ext_ref, 0); } You make the function effectively boolean, make this explicit by changing its return value to bool. Also the name and the boolean return makes the function REALLY confusing. I.e when should we return true or false? As it stands it return "false" on success and "true" otherwise, this is a mess... /* -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 02/17] btrfs-progs: lowmem check: record returned errors after walk_down_tree_v2()
On 12/29/2017 07:17 PM, Nikolay Borisov wrote: On 20.12.2017 06:57, Su Yue wrote: In lowmem mode with '--repair', check_chunks_and_extents_v2() will fix accounting in block groups and clear the error bit BG_ACCOUNTING_ERROR. However, return value of check_btrfs_root() is 0 either 1 instead of error bits. If extent tree is on error, lowmem repair always prints error and returns nonzero value even the filesystem is fine after repair. So let @err contains bits after walk_down_tree_v2(). Introduce FATAL_ERROR for lowmem mode to represents negative return values since negative and positive can't not be mixed in bits operations. Signed-off-by: Su Yue--- cmds-check.c | 13 +++-- 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/cmds-check.c b/cmds-check.c index 309ac9553b3a..ebede26cef01 100644 --- a/cmds-check.c +++ b/cmds-check.c @@ -134,6 +134,7 @@ struct data_backref { #define DIR_INDEX_MISMATCH (1<<19) /* INODE_INDEX found but not match */ #define DIR_COUNT_AGAIN (1<<20) /* DIR isize should be recalculated */ #define BG_ACCOUNTING_ERROR (1<<21) /* Block group accounting error */ +#define FATAL_ERROR (1<<22) /* fatal bit for errno */ static inline struct data_backref* to_data_backref(struct extent_backref *back) { @@ -6556,7 +6557,7 @@ static struct data_backref *find_data_backref(struct extent_record *rec, *otherwise means check fs tree(s) items relationship and * @root MUST be a fs tree root. * Returns 0 represents OK. - * Returns not 0 represents error. + * Returns > 0represents error bits. */ What about the code in 'if (!check_all)' branch, check_fs_first_inode can return a negative value, hence check_btrfs_root can return a negative value. A negative value can also be returned from btrfs_search_slot. Clearly this patch needs to be thought out better static int check_btrfs_root(struct btrfs_trans_handle *trans, struct btrfs_root *root, unsigned int ext_ref, @@ -6607,12 +6608,12 @@ static int check_btrfs_root(struct btrfs_trans_handle *trans, while (1) { ret = walk_down_tree_v2(trans, root, , , , ext_ref, check_all); - - err |= !!ret; + if (ret > 0) + err |= ret; /* if ret is negative, walk shall stop */ if (ret < 0) { - ret = err; + ret = err | FATAL_ERROR; break; } @@ -6636,12 +6637,12 @@ out: * @ext_ref: the EXTENDED_IREF feature * * Return 0 if no error found. - * Return <0 for error. + * Return not 0 for error. */ static int check_fs_root_v2(struct btrfs_root *root, unsigned int ext_ref) { reset_cached_block_groups(root->fs_info); - return check_btrfs_root(NULL, root, ext_ref, 0); + return !!check_btrfs_root(NULL, root, ext_ref, 0); } You make the function effectively boolean, make this explicit by changing its return value to bool. Also the name and the boolean return makes the function REALLY confusing. I.e when should we return true or In the past and present, check_fs_root_v2() always returns boolean. So the old annotation "Return <0 for error." is wrong. Here check_btrfs_root() returns error bits instead of boolean, so I just make check_fs_root_v2() return boolean explictly. false? As it stands it return "false" on success and "true" otherwise, this is a mess... Although it returns 1 or 0, IMHO, let it return inteager is good enough. Thanks, Su /* -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: A Big Thank You, and some Notes on Current Recovery Tools.
On 2018年01月02日 06:50, waxhead wrote: > Qu Wenruo wrote: >> >> >> On 2018年01月01日 08:48, Stirling Westrup wrote: >>> Okay, I want to start this post with a HUGE THANK YOU THANK YOU THANK >>> YOU to Nikolay Borisov and most especially to Qu Wenruo! >>> >>> Thanks to their tireless help in answering all my dumb questions I >>> have managed to get my BTRFS working again! As I speak I have the >>> full, non-degraded, quad of drives mounted and am updating my latest >>> backup of their contents. >>> >>> I had a 4-drive setup with 2x4T and 2x2T drives and one of the 2T >>> drives failed, and with help I was able to make a 100% recovery of the >>> lost data. I do have some observations on what I went through though. >>> Take this as constructive criticism, or as a point for discussing >>> additions to the recovery tools: >>> >>> 1) I had a 2T drive die with exactly 3 hard-sector errors and those 3 >>> errors exactly coincided with the 3 super-blocks on the drive. >> >> WTF, why all these corruption all happens at btrfs super blocks?! >> >> What a coincident. >> >>> The >>> odds against this happening as random independent events is so >>> unlikely as to be mind-boggling. (Something like odds of 1 in 10^26) >> >> Yep, that's also why I was thinking the corruption is much heavier than >> our expectation. >> >> But if this turns out to be superblocks only, then as long as superblock >> can be recovered, you're OK to go. >> >>> So, I'm going to guess this wasn't random chance. Its possible that >>> something inside the drive's layers of firmware is to blame, but it >>> seems more likely to me that there must be some BTRFS process that >>> can, under some conditions, try to update all superblocks as quickly >>> as possible. >> >> Btrfs only tries to update its superblock when committing transaction. >> And it's only done after all devices are flushed. >> >> AFAIK there is nothing strange. >> >>> I think it must be that a drive failure during this >>> window managed to corrupt all three superblocks. >> >> Maybe, but at least the first (primary) superblock is written with FUA >> flag, unless you enabled libata FUA support (which is disabled by >> default) AND your driver supports native FUA (not all HDD supports it, I >> only have a seagate 3.5 HDD supports it), FUA write will be converted to >> write & flush, which should be quite safe. >> >> The only timing I can think of is, between the superblock write request >> submit and the wait for them. >> >> But anyway, btrfs superblocks are the ONLY metadata not protected by >> CoW, so it is possible something may go wrong at certain timming. >> > > So from what I can piece together SSD mode is safer even for regular > harddisks correct? > > According to this... > https://btrfs.wiki.kernel.org/index.php/On-disk_Format#Superblock > > - There is 3x superblocks for every device. At most 3x. The 3rd one is for device larger than 256G. > - The superblocks are updated every 30 seconds if there is any changes... The interval can be specified by commit= mount option. And 30 is the default. > - SSD mode will not try to update all superblocks in one go, but update > one by one every 30 seconds. If I didn't miss anything, from write_dev_supers() and wait_dev_supers(), nothing checkes SSD mount option flag to do anything different. So, again if I didn't miss anything, superblock write is the same, unless you're using nobarrier mount option. Thanks, Qu > > So if SSD mode is enabled even for harddisks then only 60 seconds of > filesystem history / activity will potentially be lost... this sounds > like a reasonable trade-off compared to having your entire filesystem > hampered if your hardware is not perhaps optimal (which is sort of the > point with BTRFS' checksumming anyway) > > So would it make sense to enable SSD behavior by default for HDD's ?! > >>> It may be better to >>> perform an update-readback-compare on each superblock before moving >>> onto the next, so as to avoid this particular failure in the future. I >>> doubt this would slow things down much as the superblocks must be >>> cached in memory anyway. >> >> That should be done by block layer, where things like dm-integrity could >> help. >> >>> >>> 2) The recovery tools seem too dumb while thinking they are smarter >>> than they are. There should be some way to tell the various tools to >>> consider some subset of the drives in a system as worth considering. >> >> My fault, in fact there is a -F option for dump-super, to force it to >> recognize the bad superblock and output whatever it has. >> >> In that case at least we could be able to see if it was really corrupted >> or just some bitflip in magic numbers. >> >>> Not knowing that a superblock was a single 4096-byte sector, I had >>> primed my recovery by copying a valid superblock from one drive to the >>> clone of my broken drive before starting the ddrescue of the failing >>> drive. I had hoped that I could piece together a valid superblock from >>> a
Re: A Big Thank You, and some Notes on Current Recovery Tools.
Qu Wenruo wrote: On 2018年01月01日 08:48, Stirling Westrup wrote: Okay, I want to start this post with a HUGE THANK YOU THANK YOU THANK YOU to Nikolay Borisov and most especially to Qu Wenruo! Thanks to their tireless help in answering all my dumb questions I have managed to get my BTRFS working again! As I speak I have the full, non-degraded, quad of drives mounted and am updating my latest backup of their contents. I had a 4-drive setup with 2x4T and 2x2T drives and one of the 2T drives failed, and with help I was able to make a 100% recovery of the lost data. I do have some observations on what I went through though. Take this as constructive criticism, or as a point for discussing additions to the recovery tools: 1) I had a 2T drive die with exactly 3 hard-sector errors and those 3 errors exactly coincided with the 3 super-blocks on the drive. WTF, why all these corruption all happens at btrfs super blocks?! What a coincident. The odds against this happening as random independent events is so unlikely as to be mind-boggling. (Something like odds of 1 in 10^26) Yep, that's also why I was thinking the corruption is much heavier than our expectation. But if this turns out to be superblocks only, then as long as superblock can be recovered, you're OK to go. So, I'm going to guess this wasn't random chance. Its possible that something inside the drive's layers of firmware is to blame, but it seems more likely to me that there must be some BTRFS process that can, under some conditions, try to update all superblocks as quickly as possible. Btrfs only tries to update its superblock when committing transaction. And it's only done after all devices are flushed. AFAIK there is nothing strange. I think it must be that a drive failure during this window managed to corrupt all three superblocks. Maybe, but at least the first (primary) superblock is written with FUA flag, unless you enabled libata FUA support (which is disabled by default) AND your driver supports native FUA (not all HDD supports it, I only have a seagate 3.5 HDD supports it), FUA write will be converted to write & flush, which should be quite safe. The only timing I can think of is, between the superblock write request submit and the wait for them. But anyway, btrfs superblocks are the ONLY metadata not protected by CoW, so it is possible something may go wrong at certain timming. So from what I can piece together SSD mode is safer even for regular harddisks correct? According to this... https://btrfs.wiki.kernel.org/index.php/On-disk_Format#Superblock - There is 3x superblocks for every device. - The superblocks are updated every 30 seconds if there is any changes... - SSD mode will not try to update all superblocks in one go, but update one by one every 30 seconds. So if SSD mode is enabled even for harddisks then only 60 seconds of filesystem history / activity will potentially be lost... this sounds like a reasonable trade-off compared to having your entire filesystem hampered if your hardware is not perhaps optimal (which is sort of the point with BTRFS' checksumming anyway) So would it make sense to enable SSD behavior by default for HDD's ?! It may be better to perform an update-readback-compare on each superblock before moving onto the next, so as to avoid this particular failure in the future. I doubt this would slow things down much as the superblocks must be cached in memory anyway. That should be done by block layer, where things like dm-integrity could help. 2) The recovery tools seem too dumb while thinking they are smarter than they are. There should be some way to tell the various tools to consider some subset of the drives in a system as worth considering. My fault, in fact there is a -F option for dump-super, to force it to recognize the bad superblock and output whatever it has. In that case at least we could be able to see if it was really corrupted or just some bitflip in magic numbers. Not knowing that a superblock was a single 4096-byte sector, I had primed my recovery by copying a valid superblock from one drive to the clone of my broken drive before starting the ddrescue of the failing drive. I had hoped that I could piece together a valid superblock from a good drive, and whatever I could recover from the failing one. In the end this turned out to be a useful strategy, but meanwhile I had two drives that both claimed to be drive 2 of 4, and no drive claiming to be drive 1 of 4. The tools completely failed to deal with this case and were consistently preferring to read the bogus drive 2 instead of the real drive 2, and it wasn't until I deliberately patched over the magic in the cloned drive that I could use the various recovery tools without bizarre and spurious errors. I understand how this was never an anticipated scenario for the recovery process, but if its happened once, it could happen again. Just dealing with a failing drive and its clone both available in one
Re: A Big Thank You, and some Notes on Current Recovery Tools.
On Mon, Jan 1, 2018 at 7:15 AM, Kai Krakowwrote: > Am Mon, 01 Jan 2018 18:13:10 +0800 schrieb Qu Wenruo: > >> On 2018年01月01日 08:48, Stirling Westrup wrote: >>> >>> 1) I had a 2T drive die with exactly 3 hard-sector errors and those 3 >>> errors exactly coincided with the 3 super-blocks on the drive. >> >> WTF, why all these corruption all happens at btrfs super blocks?! >> >> What a coincident. > > Maybe it's a hybrid drive with flash? Or something that went wrong in the > drive-internal cache memory the very time when superblocks where updated? > > I bet that the sectors aren't really broken, just the on-disk checksum > didn't match the sector. I remember such things happening to me more than > once back in the days when drives where still connected by molex power > connectors. Those connectors started to get loose over time, due to > thermals or repeated disconnect and connect. That is, drives sometimes > started to no longer have a reliable power source which let to all sorts > of very strange problems, mostly resulting in pseudo-defective sectors. > > That said, the OP would like to check the power supply after this > coincidence... Maybe it's aging and no longer able to support all four > drives, CPU, GPU and stuff with stable power. You may be right about the cause of the error being a power-supply issue. For those that are curious, the drive that failed was a Seagate Barracuda LP 2000G drive (ST2000DL003). I hadn't gone into the particulars of the failure, but the BTRFS in question is my file server and it mostly holds ripped DVDs, so the storage tends to grow in size but existing files seldom change, unless I reorganize things. The intent is for it to be backed up to a proper RAIDed BTRFS system weekly, but I have to admit that I've never gotten around to automating the start of backups and have just been running it whenever I make large changes to the file server, or reorganize things. I was starting to run out of space on the file server, and I had noticed a few transient drive errors in the logs (from the 2T device that failed) and so had decided I'd add another 2T device to the array temporarily, and then replace both the failing device and the temp device with a new 4T drive once I'd had a chance to go buy a new one. In hind sight (which is always 20/20), I should have updated the backups before starting to make my changes, but as I'd just added a new 4T drive to the BTRFS RAID6 in my backup system a week before, and it went as smooth as butter, I guess I was feeling insufficiently paranoid. I shut down the system, installed the 5th drive, rebooted... and nothing. The system made some horrible sounds and refused to boot. It wouldn't even get past POST. Not being a hardware guy I wasn't sure what killed my server box, but I assume it was the power supply. Again, once I get the chance I'll take it to my local computer shop and have someone look at it. Luckily I had an exactly identical system laying idle, so I swapped all the drives and the extra sata controller to handle them, and booted it up, only to find that the failing drive had now definitely failed. Interesting, the various tools I used kept reporting an 'unknown error' for the 3 bad sectors. IIRC, one of the diagnostic tools reported it as "Error 11 (Unknown)". In any case, there appeared to be many errors on the disk, but when I used ddrescue to make a full copy of it, all of the sectors were (eventually) fully recovered, except for the 3 superblocks. After a few days of non-destructive tests and googling for information on BTRFS multi-drive systems, I finally decided I had to contact this list for advice, and the rest is well documented. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: A Big Thank You, and some Notes on Current Recovery Tools.
Am Mon, 01 Jan 2018 18:13:10 +0800 schrieb Qu Wenruo: > On 2018年01月01日 08:48, Stirling Westrup wrote: >> Okay, I want to start this post with a HUGE THANK YOU THANK YOU THANK >> YOU to Nikolay Borisov and most especially to Qu Wenruo! >> >> Thanks to their tireless help in answering all my dumb questions I have >> managed to get my BTRFS working again! As I speak I have the full, >> non-degraded, quad of drives mounted and am updating my latest backup >> of their contents. >> >> I had a 4-drive setup with 2x4T and 2x2T drives and one of the 2T >> drives failed, and with help I was able to make a 100% recovery of the >> lost data. I do have some observations on what I went through though. >> Take this as constructive criticism, or as a point for discussing >> additions to the recovery tools: >> >> 1) I had a 2T drive die with exactly 3 hard-sector errors and those 3 >> errors exactly coincided with the 3 super-blocks on the drive. > > WTF, why all these corruption all happens at btrfs super blocks?! > > What a coincident. Maybe it's a hybrid drive with flash? Or something that went wrong in the drive-internal cache memory the very time when superblocks where updated? I bet that the sectors aren't really broken, just the on-disk checksum didn't match the sector. I remember such things happening to me more than once back in the days when drives where still connected by molex power connectors. Those connectors started to get loose over time, due to thermals or repeated disconnect and connect. That is, drives sometimes started to no longer have a reliable power source which let to all sorts of very strange problems, mostly resulting in pseudo-defective sectors. That said, the OP would like to check the power supply after this coincidence... Maybe it's aging and no longer able to support all four drives, CPU, GPU and stuff with stable power. -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: A Big Thank You, and some Notes on Current Recovery Tools.
On 2018年01月01日 08:48, Stirling Westrup wrote: > Okay, I want to start this post with a HUGE THANK YOU THANK YOU THANK > YOU to Nikolay Borisov and most especially to Qu Wenruo! > > Thanks to their tireless help in answering all my dumb questions I > have managed to get my BTRFS working again! As I speak I have the > full, non-degraded, quad of drives mounted and am updating my latest > backup of their contents. > > I had a 4-drive setup with 2x4T and 2x2T drives and one of the 2T > drives failed, and with help I was able to make a 100% recovery of the > lost data. I do have some observations on what I went through though. > Take this as constructive criticism, or as a point for discussing > additions to the recovery tools: > > 1) I had a 2T drive die with exactly 3 hard-sector errors and those 3 > errors exactly coincided with the 3 super-blocks on the drive. WTF, why all these corruption all happens at btrfs super blocks?! What a coincident. > The > odds against this happening as random independent events is so > unlikely as to be mind-boggling. (Something like odds of 1 in 10^26) Yep, that's also why I was thinking the corruption is much heavier than our expectation. But if this turns out to be superblocks only, then as long as superblock can be recovered, you're OK to go. > So, I'm going to guess this wasn't random chance. Its possible that > something inside the drive's layers of firmware is to blame, but it > seems more likely to me that there must be some BTRFS process that > can, under some conditions, try to update all superblocks as quickly > as possible. Btrfs only tries to update its superblock when committing transaction. And it's only done after all devices are flushed. AFAIK there is nothing strange. > I think it must be that a drive failure during this > window managed to corrupt all three superblocks. Maybe, but at least the first (primary) superblock is written with FUA flag, unless you enabled libata FUA support (which is disabled by default) AND your driver supports native FUA (not all HDD supports it, I only have a seagate 3.5 HDD supports it), FUA write will be converted to write & flush, which should be quite safe. The only timing I can think of is, between the superblock write request submit and the wait for them. But anyway, btrfs superblocks are the ONLY metadata not protected by CoW, so it is possible something may go wrong at certain timming. > It may be better to > perform an update-readback-compare on each superblock before moving > onto the next, so as to avoid this particular failure in the future. I > doubt this would slow things down much as the superblocks must be > cached in memory anyway. That should be done by block layer, where things like dm-integrity could help. > > 2) The recovery tools seem too dumb while thinking they are smarter > than they are. There should be some way to tell the various tools to > consider some subset of the drives in a system as worth considering. My fault, in fact there is a -F option for dump-super, to force it to recognize the bad superblock and output whatever it has. In that case at least we could be able to see if it was really corrupted or just some bitflip in magic numbers. > Not knowing that a superblock was a single 4096-byte sector, I had > primed my recovery by copying a valid superblock from one drive to the > clone of my broken drive before starting the ddrescue of the failing > drive. I had hoped that I could piece together a valid superblock from > a good drive, and whatever I could recover from the failing one. In > the end this turned out to be a useful strategy, but meanwhile I had > two drives that both claimed to be drive 2 of 4, and no drive claiming > to be drive 1 of 4. The tools completely failed to deal with this case > and were consistently preferring to read the bogus drive 2 instead of > the real drive 2, and it wasn't until I deliberately patched over the > magic in the cloned drive that I could use the various recovery tools > without bizarre and spurious errors. I understand how this was never > an anticipated scenario for the recovery process, but if its happened > once, it could happen again. Just dealing with a failing drive and its > clone both available in one system could cause this. Well, most tools put more focus on not screwing things further, so it's common it's not as smart as user really want. At least, super-recover could take more advantage of using chunk tree to regenerate the super if user really want. (Although so far only one case, and that's your case, could take use of this possible new feature though) > > 3) There don't appear to be any tools designed for dumping a full > superblock in hex notation, or for patching a superblock in place. > Seeing as I was forced to use a hex editor to do exactly that, and > then go through hoops to generate a correct CSUM for the patched > block, I would certainly have preferred there to be some sort of > utility to do the