Btrfs reserve metadata problem

2018-01-01 Thread robbieko

Hi All,

When testing Btrfs with fio 4k random write, I found that volume with 
smaller free space available has lower performance.


It seems that the smaller the free space of volume is, the smaller 
amount of dirty page filesystem could have.
There are only 6 MB dirty pages when free space of volume is only 10GB 
with 16KB-nodesize and cow disabled.


btrfs will reserve metadata for every write.
The amount to reserve is calculated as follows: nodesize * 
BTRFS_MAX_LEVEL(8) * 2, i.e., it reserves 256KB of metadata.
The maximum amount of metadata reservation depends on size of metadata 
currently in used and free space within volume(free chunk size /16)
When metadata reaches the limit, btrfs will need to flush the data to 
release the reservation.


1. Is there any logic behind the value (free chunk size /16)

 /*
  * If we have dup, raid1 or raid10 then only half of the free
  * space is actually useable. For raid56, the space info used
  * doesn't include the parity drive, so we don't have to
  * change the math
  */
 if (profile & (BTRFS_BLOCK_GROUP_DUP |
 BTRFS_BLOCK_GROUP_RAID1 |
 BTRFS_BLOCK_GROUP_RAID10))
  avail >>= 1;

 /*
  * If we aren't flushing all things, let us overcommit up to
  * 1/2th of the space. If we can flush, don't let us overcommit
  * too much, let it overcommit up to 1/8 of the space.
  */
 if (flush == BTRFS_RESERVE_FLUSH_ALL)
  avail >>= 3;
 else
  avail >>= 1;

2. Is there any way to improve this problem?

Thanks.
Robbie Ko
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] btrfs-progs: fi-usage: Fix wrong RAID10 used and unallocated space

2018-01-01 Thread Qu Wenruo
[BUG]
For a very basic RAID10 with 4 disks, "fi usage" and "fi show" are
outputting conflicting result:
--
 # btrfs fi show /mnt/btrfs/
Label: none  uuid: 6d0229db-28d1-4696-ac20-e828cc45dc40
Total devices 4 FS bytes used 1.12MiB
devid1 size 5.00GiB used 2.01GiB path /dev/mapper/data-disk1
devid2 size 5.00GiB used 2.01GiB path /dev/mapper/data-disk2
devid3 size 5.00GiB used 2.01GiB path /dev/mapper/data-disk3
devid4 size 5.00GiB used 2.01GiB path /dev/mapper/data-disk4

Here the unallocated space for disk4 should be a little less than 3G.
(5.00 Gib - 2.01GiB)

 # btrfs fi usage  /mnt/btrfs/
Overall:
Device size:  20.00GiB
Device allocated:  8.03GiB
Device unallocated:   11.97GiB
Device missing:  0.00B
Used:  2.25MiB
Free (estimated):  7.98GiB  (min: 7.98GiB)
Data ratio:   2.00
Metadata ratio:   2.00
Global reserve:   16.00MiB  (used: 0.00B)

Data,RAID10: Size:2.00GiB, Used:1.00MiB
   ...
   /dev/mapper/data-disk4512.00MiB

Metadata,RAID10: Size:2.00GiB, Used:112.00KiB
   ...
   /dev/mapper/data-disk4512.00MiB

System,RAID10: Size:16.00MiB, Used:16.00KiB
   ...
   /dev/mapper/data-disk4  4.00MiB

Unallocated:
   ...
   /dev/mapper/data-disk4  4.00GiB

While fi usage shows we still have 4.00GiB unallocated space.
--

[CAUSE]
calc_chunk_size() is used to convert chunk size to device extent size,
which is used to get the per-device data/meta/sys used space.

However, for RAID10 we just divide the chunk size with num_stripes,
without taking sub stripes into account.

Resulting data/meta/sys usage in RAID10 halved.

[FIX]
Take the missing sub stripes into consideration.

Reported-by: Adam Bahe 
Signed-off-by: Qu Wenruo 
---
 cmds-fi-usage.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/cmds-fi-usage.c b/cmds-fi-usage.c
index 0b0e47fee194..0af4bdbc9370 100644
--- a/cmds-fi-usage.c
+++ b/cmds-fi-usage.c
@@ -664,7 +664,7 @@ static u64 calc_chunk_size(struct chunk_info *ci)
else if (ci->type & BTRFS_BLOCK_GROUP_RAID6)
return ci->size / (ci->num_stripes -2);
else if (ci->type & BTRFS_BLOCK_GROUP_RAID10)
-   return ci->size / ci->num_stripes;
+   return ci->size / (ci->num_stripes / 2);
return ci->size;
 }
 
-- 
2.15.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: A Big Thank You, and some Notes on Current Recovery Tools.

2018-01-01 Thread Duncan
Stirling Westrup posted on Mon, 01 Jan 2018 14:44:43 -0500 as excerpted:

> In hind sight (which is always 20/20), I should have updated the backups
> before starting to make my changes, but as I'd just added a new 4T drive
> to the BTRFS RAID6 in my backup system a week before, and it went as
> smooth as butter, I guess I was feeling insufficiently paranoid.

Are you aware of btrfs raid56-mode history?

If you're running a current enough kernel (wiki says 4.12 for raid56 
mode, but you might want 4.14 for other fixes and/or the fact that it's 
LTS) the severest known raid56 issues that had it recommendation-
blacklisted are fixed, but raid56 mode still doesn't have fixes for the 
infamous parity-raid write hole, and parities are not checksummed, in 
hindsight an implementation mistake as it breaks btrfs' otherwise 
integrity and checksumming guarantees, that's going to require an on-disk 
format change and some major work to fix.

If you're running at least kernel 4.12 and are aware of and understand 
the remaining raid56 caveats, raid56 mode can be a valid choice, but if 
not, I strongly recommend doing more research to learn and understand 
those caveats, before relying too heavily on that backup.

The most reliable and well tested btrfs multi-device mode remains raid1, 
tho that's expensive in terms of space required since it duplicates 
everything.  For many devices, the recommendation seems to remain btrfs 
raid1, either straight, or on top of a pair of mdraid0s (or alike, 
dmraid0s, hardware raid0s, etc), since that performs better than btrfs 
raid10, and removes a confusing tho not harmful if properly understood 
layout ambiguity of btrfs raid10 as well.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 02/17] btrfs-progs: lowmem check: record returned errors after walk_down_tree_v2()

2018-01-01 Thread Su Yue



On 12/29/2017 07:17 PM, Nikolay Borisov wrote:



On 20.12.2017 06:57, Su Yue wrote:

In lowmem mode with '--repair', check_chunks_and_extents_v2()
will fix accounting in block groups and clear the error
bit BG_ACCOUNTING_ERROR.
However, return value of check_btrfs_root() is 0 either 1 instead of
error bits.

If extent tree is on error, lowmem repair always prints error and
returns nonzero value even the filesystem is fine after repair.

So let @err contains bits after walk_down_tree_v2().

Introduce FATAL_ERROR for lowmem mode to represents negative return
values since negative and positive can't not be mixed in bits operations.

Signed-off-by: Su Yue 
---
  cmds-check.c | 13 +++--
  1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/cmds-check.c b/cmds-check.c
index 309ac9553b3a..ebede26cef01 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -134,6 +134,7 @@ struct data_backref {
  #define DIR_INDEX_MISMATCH  (1<<19) /* INODE_INDEX found but not match */
  #define DIR_COUNT_AGAIN (1<<20) /* DIR isize should be recalculated */
  #define BG_ACCOUNTING_ERROR (1<<21) /* Block group accounting error */
+#define FATAL_ERROR (1<<22) /* fatal bit for errno */
  
  static inline struct data_backref* to_data_backref(struct extent_backref *back)

  {
@@ -6556,7 +6557,7 @@ static struct data_backref *find_data_backref(struct 
extent_record *rec,
   *otherwise means check fs tree(s) items relationship and
   *  @root MUST be a fs tree root.
   * Returns 0  represents OK.
- * Returns not 0  represents error.
+ * Returns > 0represents error bits.
   */


What about the code in 'if (!check_all)' branch, check_fs_first_inode
can return a negative value, hence check_btrfs_root can return a
negative value. A negative value can also be returned from
btrfs_search_slot.

Clearly this patch needs to be thought out better


OK, I will update it.
Thanks for review.

  static int check_btrfs_root(struct btrfs_trans_handle *trans,
struct btrfs_root *root, unsigned int ext_ref,
@@ -6607,12 +6608,12 @@ static int check_btrfs_root(struct btrfs_trans_handle 
*trans,
while (1) {
ret = walk_down_tree_v2(trans, root, , , ,
ext_ref, check_all);
-
-   err |= !!ret;
+   if (ret > 0)
+   err |= ret;
  
  		/* if ret is negative, walk shall stop */

if (ret < 0) {
-   ret = err;
+   ret = err | FATAL_ERROR;
break;
}
  
@@ -6636,12 +6637,12 @@ out:

   * @ext_ref:  the EXTENDED_IREF feature
   *
   * Return 0 if no error found.
- * Return <0 for error.
+ * Return not 0 for error.
   */
  static int check_fs_root_v2(struct btrfs_root *root, unsigned int ext_ref)
  {
reset_cached_block_groups(root->fs_info);
-   return check_btrfs_root(NULL, root, ext_ref, 0);
+   return !!check_btrfs_root(NULL, root, ext_ref, 0);
  }


You make the function effectively boolean, make this explicit by
changing its return value to bool. Also the name and the boolean return
makes the function REALLY confusing. I.e when should we return true or
false? As it stands it return "false" on success and "true" otherwise,
this is a mess...


  
  /*








--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 02/17] btrfs-progs: lowmem check: record returned errors after walk_down_tree_v2()

2018-01-01 Thread Su Yue



On 12/29/2017 07:17 PM, Nikolay Borisov wrote:



On 20.12.2017 06:57, Su Yue wrote:

In lowmem mode with '--repair', check_chunks_and_extents_v2()
will fix accounting in block groups and clear the error
bit BG_ACCOUNTING_ERROR.
However, return value of check_btrfs_root() is 0 either 1 instead of
error bits.

If extent tree is on error, lowmem repair always prints error and
returns nonzero value even the filesystem is fine after repair.

So let @err contains bits after walk_down_tree_v2().

Introduce FATAL_ERROR for lowmem mode to represents negative return
values since negative and positive can't not be mixed in bits operations.

Signed-off-by: Su Yue 
---
  cmds-check.c | 13 +++--
  1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/cmds-check.c b/cmds-check.c
index 309ac9553b3a..ebede26cef01 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -134,6 +134,7 @@ struct data_backref {
  #define DIR_INDEX_MISMATCH  (1<<19) /* INODE_INDEX found but not match */
  #define DIR_COUNT_AGAIN (1<<20) /* DIR isize should be recalculated */
  #define BG_ACCOUNTING_ERROR (1<<21) /* Block group accounting error */
+#define FATAL_ERROR (1<<22) /* fatal bit for errno */
  
  static inline struct data_backref* to_data_backref(struct extent_backref *back)

  {
@@ -6556,7 +6557,7 @@ static struct data_backref *find_data_backref(struct 
extent_record *rec,
   *otherwise means check fs tree(s) items relationship and
   *  @root MUST be a fs tree root.
   * Returns 0  represents OK.
- * Returns not 0  represents error.
+ * Returns > 0represents error bits.
   */


What about the code in 'if (!check_all)' branch, check_fs_first_inode
can return a negative value, hence check_btrfs_root can return a
negative value. A negative value can also be returned from
btrfs_search_slot.

Clearly this patch needs to be thought out better


  static int check_btrfs_root(struct btrfs_trans_handle *trans,
struct btrfs_root *root, unsigned int ext_ref,
@@ -6607,12 +6608,12 @@ static int check_btrfs_root(struct btrfs_trans_handle 
*trans,
while (1) {
ret = walk_down_tree_v2(trans, root, , , ,
ext_ref, check_all);
-
-   err |= !!ret;
+   if (ret > 0)
+   err |= ret;
  
  		/* if ret is negative, walk shall stop */

if (ret < 0) {
-   ret = err;
+   ret = err | FATAL_ERROR;
break;
}
  
@@ -6636,12 +6637,12 @@ out:

   * @ext_ref:  the EXTENDED_IREF feature
   *
   * Return 0 if no error found.
- * Return <0 for error.
+ * Return not 0 for error.
   */
  static int check_fs_root_v2(struct btrfs_root *root, unsigned int ext_ref)
  {
reset_cached_block_groups(root->fs_info);
-   return check_btrfs_root(NULL, root, ext_ref, 0);
+   return !!check_btrfs_root(NULL, root, ext_ref, 0);
  }


You make the function effectively boolean, make this explicit by
changing its return value to bool. Also the name and the boolean return
makes the function REALLY confusing. I.e when should we return true or


In the past and present, check_fs_root_v2() always returns boolean.
So the old annotation "Return <0 for error." is wrong.

Here check_btrfs_root() returns error bits instead of boolean, so I
just make check_fs_root_v2() return boolean explictly.


false? As it stands it return "false" on success and "true" otherwise,
this is a mess...


Although it returns 1 or 0, IMHO, let it return inteager
is good enough.

Thanks,
Su



  
  /*








--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: A Big Thank You, and some Notes on Current Recovery Tools.

2018-01-01 Thread Qu Wenruo


On 2018年01月02日 06:50, waxhead wrote:
> Qu Wenruo wrote:
>>
>>
>> On 2018年01月01日 08:48, Stirling Westrup wrote:
>>> Okay, I want to start this post with a HUGE THANK YOU THANK YOU THANK
>>> YOU to Nikolay Borisov and most especially to Qu Wenruo!
>>>
>>> Thanks to their tireless help in answering all my dumb questions I
>>> have managed to get my BTRFS working again! As I speak I have the
>>> full, non-degraded, quad of drives mounted and am updating my latest
>>> backup of their contents.
>>>
>>> I had a 4-drive setup with 2x4T and 2x2T drives and one of the 2T
>>> drives failed, and with help I was able to make a 100% recovery of the
>>> lost data. I do have some observations on what I went through though.
>>> Take this as constructive criticism, or as a point for discussing
>>> additions to the recovery tools:
>>>
>>> 1) I had a 2T drive die with exactly 3 hard-sector errors and those 3
>>> errors exactly coincided with the 3 super-blocks on the drive.
>>
>> WTF, why all these corruption all happens at btrfs super blocks?!
>>
>> What a coincident.
>>
>>> The
>>> odds against this happening as random independent events is so
>>> unlikely as to be mind-boggling. (Something like odds of 1 in 10^26)
>>
>> Yep, that's also why I was thinking the corruption is much heavier than
>> our expectation.
>>
>> But if this turns out to be superblocks only, then as long as superblock
>> can be recovered, you're OK to go.
>>
>>> So, I'm going to guess this wasn't random chance. Its possible that
>>> something inside the drive's layers of firmware is to blame, but it
>>> seems more likely to me that there must be some BTRFS process that
>>> can, under some conditions, try to update all superblocks as quickly
>>> as possible.
>>
>> Btrfs only tries to update its superblock when committing transaction.
>> And it's only done after all devices are flushed.
>>
>> AFAIK there is nothing strange.
>>
>>> I think it must be that a drive failure during this
>>> window managed to corrupt all three superblocks.
>>
>> Maybe, but at least the first (primary) superblock is written with FUA
>> flag, unless you enabled libata FUA support (which is disabled by
>> default) AND your driver supports native FUA (not all HDD supports it, I
>> only have a seagate 3.5 HDD supports it), FUA write will be converted to
>> write & flush, which should be quite safe.
>>
>> The only timing I can think of is, between the superblock write request
>> submit and the wait for them.
>>
>> But anyway, btrfs superblocks are the ONLY metadata not protected by
>> CoW, so it is possible something may go wrong at certain timming.
>>
> 
> So from what I can piece together SSD mode is safer even for regular
> harddisks correct?
> 
> According to this...
> https://btrfs.wiki.kernel.org/index.php/On-disk_Format#Superblock
> 
> - There is 3x superblocks for every device.

At most 3x. The 3rd one is for device larger than 256G.

> - The superblocks are updated every 30 seconds if there is any changes...

The interval can be specified by commit= mount option.
And 30 is the default.

> - SSD mode will not try to update all superblocks in one go, but update
> one by one every 30 seconds.

If I didn't miss anything, from write_dev_supers() and
wait_dev_supers(), nothing checkes SSD mount option flag to do anything
different.

So, again if I didn't miss anything, superblock write is the same,
unless you're using nobarrier mount option.

Thanks,
Qu
> 
> So if SSD mode is enabled even for harddisks then only 60 seconds of
> filesystem history / activity will potentially be lost... this sounds
> like a reasonable trade-off compared to having your entire filesystem
> hampered if your hardware is not perhaps optimal (which is sort of the
> point with BTRFS' checksumming anyway)
> 
> So would it make sense to enable SSD behavior by default for HDD's ?!
> 
>>> It may be better to
>>> perform an update-readback-compare on each superblock before moving
>>> onto the next, so as to avoid this particular failure in the future. I
>>> doubt this would slow things down much as the superblocks must be
>>> cached in memory anyway.
>>
>> That should be done by block layer, where things like dm-integrity could
>> help.
>>
>>>
>>> 2) The recovery tools seem too dumb while thinking they are smarter
>>> than they are. There should be some way to tell the various tools to
>>> consider some subset of the drives in a system as worth considering.
>>
>> My fault, in fact there is a -F option for dump-super, to force it to
>> recognize the bad superblock and output whatever it has.
>>
>> In that case at least we could be able to see if it was really corrupted
>> or just some bitflip in magic numbers.
>>
>>> Not knowing that a superblock was a single 4096-byte sector, I had
>>> primed my recovery by copying a valid superblock from one drive to the
>>> clone of my broken drive before starting the ddrescue of the failing
>>> drive. I had hoped that I could piece together a valid superblock from
>>> a 

Re: A Big Thank You, and some Notes on Current Recovery Tools.

2018-01-01 Thread waxhead

Qu Wenruo wrote:



On 2018年01月01日 08:48, Stirling Westrup wrote:

Okay, I want to start this post with a HUGE THANK YOU THANK YOU THANK
YOU to Nikolay Borisov and most especially to Qu Wenruo!

Thanks to their tireless help in answering all my dumb questions I
have managed to get my BTRFS working again! As I speak I have the
full, non-degraded, quad of drives mounted and am updating my latest
backup of their contents.

I had a 4-drive setup with 2x4T and 2x2T drives and one of the 2T
drives failed, and with help I was able to make a 100% recovery of the
lost data. I do have some observations on what I went through though.
Take this as constructive criticism, or as a point for discussing
additions to the recovery tools:

1) I had a 2T drive die with exactly 3 hard-sector errors and those 3
errors exactly coincided with the 3 super-blocks on the drive.


WTF, why all these corruption all happens at btrfs super blocks?!

What a coincident.


The
odds against this happening as random independent events is so
unlikely as to be mind-boggling. (Something like odds of 1 in 10^26)


Yep, that's also why I was thinking the corruption is much heavier than
our expectation.

But if this turns out to be superblocks only, then as long as superblock
can be recovered, you're OK to go.


So, I'm going to guess this wasn't random chance. Its possible that
something inside the drive's layers of firmware is to blame, but it
seems more likely to me that there must be some BTRFS process that
can, under some conditions, try to update all superblocks as quickly
as possible.


Btrfs only tries to update its superblock when committing transaction.
And it's only done after all devices are flushed.

AFAIK there is nothing strange.


I think it must be that a drive failure during this
window managed to corrupt all three superblocks.


Maybe, but at least the first (primary) superblock is written with FUA
flag, unless you enabled libata FUA support (which is disabled by
default) AND your driver supports native FUA (not all HDD supports it, I
only have a seagate 3.5 HDD supports it), FUA write will be converted to
write & flush, which should be quite safe.

The only timing I can think of is, between the superblock write request
submit and the wait for them.

But anyway, btrfs superblocks are the ONLY metadata not protected by
CoW, so it is possible something may go wrong at certain timming.



So from what I can piece together SSD mode is safer even for regular 
harddisks correct?


According to this...
https://btrfs.wiki.kernel.org/index.php/On-disk_Format#Superblock

- There is 3x superblocks for every device.
- The superblocks are updated every 30 seconds if there is any changes...
- SSD mode will not try to update all superblocks in one go, but update 
one by one every 30 seconds.


So if SSD mode is enabled even for harddisks then only 60 seconds of 
filesystem history / activity will potentially be lost... this sounds 
like a reasonable trade-off compared to having your entire filesystem 
hampered if your hardware is not perhaps optimal (which is sort of the 
point with BTRFS' checksumming anyway)


So would it make sense to enable SSD behavior by default for HDD's ?!


It may be better to
perform an update-readback-compare on each superblock before moving
onto the next, so as to avoid this particular failure in the future. I
doubt this would slow things down much as the superblocks must be
cached in memory anyway.


That should be done by block layer, where things like dm-integrity could
help.



2) The recovery tools seem too dumb while thinking they are smarter
than they are. There should be some way to tell the various tools to
consider some subset of the drives in a system as worth considering.


My fault, in fact there is a -F option for dump-super, to force it to
recognize the bad superblock and output whatever it has.

In that case at least we could be able to see if it was really corrupted
or just some bitflip in magic numbers.


Not knowing that a superblock was a single 4096-byte sector, I had
primed my recovery by copying a valid superblock from one drive to the
clone of my broken drive before starting the ddrescue of the failing
drive. I had hoped that I could piece together a valid superblock from
a good drive, and whatever I could recover from the failing one. In
the end this turned out to be a useful strategy, but meanwhile I had
two drives that both claimed to be drive 2 of 4, and no drive claiming
to be drive 1 of 4. The tools completely failed to deal with this case
and were consistently preferring to read the bogus drive 2 instead of
the real drive 2, and it wasn't until I deliberately patched over the
magic in the cloned drive that I could use the various recovery tools
without bizarre and spurious errors. I understand how this was never
an anticipated scenario for the recovery process, but if its happened
once, it could happen again. Just dealing with a failing drive and its
clone both available in one 

Re: A Big Thank You, and some Notes on Current Recovery Tools.

2018-01-01 Thread Stirling Westrup
On Mon, Jan 1, 2018 at 7:15 AM, Kai Krakow  wrote:
> Am Mon, 01 Jan 2018 18:13:10 +0800 schrieb Qu Wenruo:
>
>> On 2018年01月01日 08:48, Stirling Westrup wrote:
>>>
>>> 1) I had a 2T drive die with exactly 3 hard-sector errors and those 3
>>> errors exactly coincided with the 3 super-blocks on the drive.
>>
>> WTF, why all these corruption all happens at btrfs super blocks?!
>>
>> What a coincident.
>
> Maybe it's a hybrid drive with flash? Or something that went wrong in the
> drive-internal cache memory the very time when superblocks where updated?
>
> I bet that the sectors aren't really broken, just the on-disk checksum
> didn't match the sector. I remember such things happening to me more than
> once back in the days when drives where still connected by molex power
> connectors. Those connectors started to get loose over time, due to
> thermals or repeated disconnect and connect. That is, drives sometimes
> started to no longer have a reliable power source which let to all sorts
> of very strange problems, mostly resulting in pseudo-defective sectors.
>
> That said, the OP would like to check the power supply after this
> coincidence... Maybe it's aging and no longer able to support all four
> drives, CPU, GPU and stuff with stable power.

You may be right about the cause of the error being a power-supply issue.
For those that are curious, the drive that failed was a Seagate Barracuda
LP 2000G drive (ST2000DL003).

I hadn't gone into the particulars of the failure, but the BTRFS in
question is my
file server and it mostly holds ripped DVDs, so the storage tends to
grow in size
but existing files seldom change, unless I reorganize things. The
intent is for it to
be backed up to a proper RAIDed BTRFS system weekly, but I have to admit that
I've never gotten around to automating the start of backups and have just been
running it whenever I make large changes to the file server, or
reorganize things.

I was starting to run out of space on the file server, and I had
noticed a few transient
drive errors in the logs (from the 2T device that failed) and so had
decided I'd add
another 2T device to the array temporarily, and then replace both the
failing device and the
temp device with a new 4T drive once I'd had a chance to go buy a new one.

In hind sight (which is always 20/20), I should have updated the
backups before starting to
make my changes, but as I'd just added a new 4T drive to the BTRFS
RAID6 in my backup
system a week before, and it went as smooth as butter, I guess I was
feeling insufficiently
paranoid.

I shut down the system, installed the 5th drive, rebooted... and
nothing. The system made some
horrible sounds and refused to boot. It wouldn't even get past POST.
Not being a hardware
guy I wasn't sure what killed my server box, but I assume it was the
power supply. Again, once
I get the chance I'll take it to my local computer shop and have
someone look at it.

Luckily I had an exactly identical system laying idle, so I swapped
all the drives and the extra sata
controller to handle them, and booted it up, only to find that the
failing drive had now definitely failed.

Interesting, the various tools I used kept reporting an 'unknown
error' for the 3 bad sectors. IIRC, one
of the diagnostic tools reported it as "Error 11 (Unknown)". In any
case, there appeared to be many
errors on the disk, but when I used ddrescue to make a full copy of
it, all of the sectors were (eventually)
fully recovered, except for the 3 superblocks.

After a few days of non-destructive tests and googling for information
on BTRFS multi-drive systems, I
finally decided I had to contact this list for advice, and the rest is
well documented.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: A Big Thank You, and some Notes on Current Recovery Tools.

2018-01-01 Thread Kai Krakow
Am Mon, 01 Jan 2018 18:13:10 +0800 schrieb Qu Wenruo:

> On 2018年01月01日 08:48, Stirling Westrup wrote:
>> Okay, I want to start this post with a HUGE THANK YOU THANK YOU THANK
>> YOU to Nikolay Borisov and most especially to Qu Wenruo!
>> 
>> Thanks to their tireless help in answering all my dumb questions I have
>> managed to get my BTRFS working again! As I speak I have the full,
>> non-degraded, quad of drives mounted and am updating my latest backup
>> of their contents.
>> 
>> I had a 4-drive setup with 2x4T and 2x2T drives and one of the 2T
>> drives failed, and with help I was able to make a 100% recovery of the
>> lost data. I do have some observations on what I went through though.
>> Take this as constructive criticism, or as a point for discussing
>> additions to the recovery tools:
>> 
>> 1) I had a 2T drive die with exactly 3 hard-sector errors and those 3
>> errors exactly coincided with the 3 super-blocks on the drive.
> 
> WTF, why all these corruption all happens at btrfs super blocks?!
> 
> What a coincident.

Maybe it's a hybrid drive with flash? Or something that went wrong in the 
drive-internal cache memory the very time when superblocks where updated?

I bet that the sectors aren't really broken, just the on-disk checksum 
didn't match the sector. I remember such things happening to me more than 
once back in the days when drives where still connected by molex power 
connectors. Those connectors started to get loose over time, due to 
thermals or repeated disconnect and connect. That is, drives sometimes 
started to no longer have a reliable power source which let to all sorts 
of very strange problems, mostly resulting in pseudo-defective sectors.

That said, the OP would like to check the power supply after this 
coincidence... Maybe it's aging and no longer able to support all four 
drives, CPU, GPU and stuff with stable power.


-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: A Big Thank You, and some Notes on Current Recovery Tools.

2018-01-01 Thread Qu Wenruo


On 2018年01月01日 08:48, Stirling Westrup wrote:
> Okay, I want to start this post with a HUGE THANK YOU THANK YOU THANK
> YOU to Nikolay Borisov and most especially to Qu Wenruo!
> 
> Thanks to their tireless help in answering all my dumb questions I
> have managed to get my BTRFS working again! As I speak I have the
> full, non-degraded, quad of drives mounted and am updating my latest
> backup of their contents.
> 
> I had a 4-drive setup with 2x4T and 2x2T drives and one of the 2T
> drives failed, and with help I was able to make a 100% recovery of the
> lost data. I do have some observations on what I went through though.
> Take this as constructive criticism, or as a point for discussing
> additions to the recovery tools:
> 
> 1) I had a 2T drive die with exactly 3 hard-sector errors and those 3
> errors exactly coincided with the 3 super-blocks on the drive.

WTF, why all these corruption all happens at btrfs super blocks?!

What a coincident.

> The
> odds against this happening as random independent events is so
> unlikely as to be mind-boggling. (Something like odds of 1 in 10^26)

Yep, that's also why I was thinking the corruption is much heavier than
our expectation.

But if this turns out to be superblocks only, then as long as superblock
can be recovered, you're OK to go.

> So, I'm going to guess this wasn't random chance. Its possible that
> something inside the drive's layers of firmware is to blame, but it
> seems more likely to me that there must be some BTRFS process that
> can, under some conditions, try to update all superblocks as quickly
> as possible.

Btrfs only tries to update its superblock when committing transaction.
And it's only done after all devices are flushed.

AFAIK there is nothing strange.

> I think it must be that a drive failure during this
> window managed to corrupt all three superblocks.

Maybe, but at least the first (primary) superblock is written with FUA
flag, unless you enabled libata FUA support (which is disabled by
default) AND your driver supports native FUA (not all HDD supports it, I
only have a seagate 3.5 HDD supports it), FUA write will be converted to
write & flush, which should be quite safe.

The only timing I can think of is, between the superblock write request
submit and the wait for them.

But anyway, btrfs superblocks are the ONLY metadata not protected by
CoW, so it is possible something may go wrong at certain timming.

> It may be better to
> perform an update-readback-compare on each superblock before moving
> onto the next, so as to avoid this particular failure in the future. I
> doubt this would slow things down much as the superblocks must be
> cached in memory anyway.

That should be done by block layer, where things like dm-integrity could
help.

> 
> 2) The recovery tools seem too dumb while thinking they are smarter
> than they are. There should be some way to tell the various tools to
> consider some subset of the drives in a system as worth considering.

My fault, in fact there is a -F option for dump-super, to force it to
recognize the bad superblock and output whatever it has.

In that case at least we could be able to see if it was really corrupted
or just some bitflip in magic numbers.

> Not knowing that a superblock was a single 4096-byte sector, I had
> primed my recovery by copying a valid superblock from one drive to the
> clone of my broken drive before starting the ddrescue of the failing
> drive. I had hoped that I could piece together a valid superblock from
> a good drive, and whatever I could recover from the failing one. In
> the end this turned out to be a useful strategy, but meanwhile I had
> two drives that both claimed to be drive 2 of 4, and no drive claiming
> to be drive 1 of 4. The tools completely failed to deal with this case
> and were consistently preferring to read the bogus drive 2 instead of
> the real drive 2, and it wasn't until I deliberately patched over the
> magic in the cloned drive that I could use the various recovery tools
> without bizarre and spurious errors. I understand how this was never
> an anticipated scenario for the recovery process, but if its happened
> once, it could happen again. Just dealing with a failing drive and its
> clone both available in one system could cause this.

Well, most tools put more focus on not screwing things further, so it's
common it's not as smart as user really want.

At least, super-recover could take more advantage of using chunk tree to
regenerate the super if user really want.
(Although so far only one case, and that's your case, could take use of
this possible new feature though)

> 
> 3) There don't appear to be any tools designed for dumping a full
> superblock in hex notation, or for patching a superblock in place.
> Seeing as I was forced to use a hex editor to do exactly that, and
> then go through hoops to generate a correct CSUM for the patched
> block, I would certainly have preferred there to be some sort of
> utility to do the