On Mon, Jan 18, 2021 at 09:00:58PM -0300, Hérikz Nawarro wrote:
> Hello everyone,
> 
> I got an array of 4 disks with btrfs configured with data single and
> metadata dup

OK, that's weird.  Multiple disks should always have metadata in a raid1*
profile (raid1, raid10, raid1c3, or raid1c4).  dup metadata on multiple
disks, especially spinners, is going to be slow and brittle with no
upside.

> , one disk of this array was plugged with a bad sata cable
> that broke the plastic part of the data port (the pins still intact),
> i still can read the disk with an adapter, but there's a way to
> "isolate" this disk, recover all data and later replace the fault disk
> in the array with a new one?

There's no redundancy in this array, so you will have to keep the broken
disk online (or the filesystem unmounted) until a solution is implemented.

I wouldn't advise running with a broken connector at all, especially
without raid1 metadata.

Ideally, boot from rescue media, copy the broken device to a replacement
disk with dd, then remove the broken disk and mount the filesystem with
4 healthy disks.

If you try to operate with a broken connector, you could get disconnects
and lost writes.  With dup metadata there is no redundancy across
drives, so a lost metadata write on a single disk is a fatal error.
That will be a stress-test for btrfs's lost write detection, and even
if it works, it will force the filesystem read-only whenever it occurs
in a metadata write.  In the worst case, the disconnection resets the
drive and prevents its write cache from working properly, so a write is
lost in metadata, and the filesystem is unrecoverably damaged.

There are other ways to do this, but they take longer, in some cases
orders of magnitude longer (and therefore higher risk):

1.  convert the metadata to raid1, starting with the faulty drive
(in these examples I'm just going to call it device 3, use the
correct device ID for your array):

        # Remove metadata from broken device first
        btrfs balance start -mdevid=3,convert=raid1,soft /array

        # Continue converting all other metadata in the array:
        btrfs balance start -mconvert=raid1,soft /array

After metadata is converted to raid1, an intermittent drive connection is
a much more recoverable problem, and you can replace the broken disk at
your leisure.  You'll get csum and IO errors when the drive disconnects,
but these errors will not be fatal to the filesystem as a whole because
the metadata will be safely written on other devices.

2.  convert the metadata to raid1 as in option 1, then delete the missing
device.  This is by far the slowest option, and only works if you have
sufficient space on the other drives for the new data.

3.  convert the metadata to raid1 as in option 1, add more disks so that
there is enough space for the device delete in option 2, then proceed
with the device delete in option 2.  This is probably worse than option
2 in terms of potential failure modes, but I put it here for completeness.

4.  when the replacement disk arrives, run 'btrfs replace' from the broken
disk to the new disk, then convert the metadata to raid1 as in option 1
so you're not using dup metadata any more.  This is as fast as the 'dd'
solution, but there is a slightly higher risk as the broken disk might
disconnect during a write and abort the replace operation.

> Cheers,

Reply via email to