> OK, that's weird. Multiple disks should always have metadata in a raid1* > profile (raid1, raid10, raid1c3, or raid1c4). dup metadata on multiple > disks, especially spinners, is going to be slow and brittle with no > upside.
I didn't know about this. > There are other ways to do this, but they take longer, in some cases > orders of magnitude longer (and therefore higher risk): > > 1. convert the metadata to raid1, starting with the faulty drive > (in these examples I'm just going to call it device 3, use the > correct device ID for your array): > > # Remove metadata from broken device first > btrfs balance start -mdevid=3,convert=raid1,soft /array > > # Continue converting all other metadata in the array: > btrfs balance start -mconvert=raid1,soft /array > > After metadata is converted to raid1, an intermittent drive connection is > a much more recoverable problem, and you can replace the broken disk at > your leisure. You'll get csum and IO errors when the drive disconnects, > but these errors will not be fatal to the filesystem as a whole because > the metadata will be safely written on other devices. > > 2. convert the metadata to raid1 as in option 1, then delete the missing > device. This is by far the slowest option, and only works if you have > sufficient space on the other drives for the new data. > > 3. convert the metadata to raid1 as in option 1, add more disks so that > there is enough space for the device delete in option 2, then proceed > with the device delete in option 2. This is probably worse than option > 2 in terms of potential failure modes, but I put it here for completeness. > > 4. when the replacement disk arrives, run 'btrfs replace' from the broken > disk to the new disk, then convert the metadata to raid1 as in option 1 > so you're not using dup metadata any more. This is as fast as the 'dd' > solution, but there is a slightly higher risk as the broken disk might > disconnect during a write and abort the replace operation. Thanks for the options, i'll try soon. Em sáb., 23 de jan. de 2021 às 14:27, Zygo Blaxell <[email protected]> escreveu: > > On Mon, Jan 18, 2021 at 09:00:58PM -0300, Hérikz Nawarro wrote: > > Hello everyone, > > > > I got an array of 4 disks with btrfs configured with data single and > > metadata dup > > OK, that's weird. Multiple disks should always have metadata in a raid1* > profile (raid1, raid10, raid1c3, or raid1c4). dup metadata on multiple > disks, especially spinners, is going to be slow and brittle with no > upside. > > > , one disk of this array was plugged with a bad sata cable > > that broke the plastic part of the data port (the pins still intact), > > i still can read the disk with an adapter, but there's a way to > > "isolate" this disk, recover all data and later replace the fault disk > > in the array with a new one? > > There's no redundancy in this array, so you will have to keep the broken > disk online (or the filesystem unmounted) until a solution is implemented. > > I wouldn't advise running with a broken connector at all, especially > without raid1 metadata. > > Ideally, boot from rescue media, copy the broken device to a replacement > disk with dd, then remove the broken disk and mount the filesystem with > 4 healthy disks. > > If you try to operate with a broken connector, you could get disconnects > and lost writes. With dup metadata there is no redundancy across > drives, so a lost metadata write on a single disk is a fatal error. > That will be a stress-test for btrfs's lost write detection, and even > if it works, it will force the filesystem read-only whenever it occurs > in a metadata write. In the worst case, the disconnection resets the > drive and prevents its write cache from working properly, so a write is > lost in metadata, and the filesystem is unrecoverably damaged. > > There are other ways to do this, but they take longer, in some cases > orders of magnitude longer (and therefore higher risk): > > 1. convert the metadata to raid1, starting with the faulty drive > (in these examples I'm just going to call it device 3, use the > correct device ID for your array): > > # Remove metadata from broken device first > btrfs balance start -mdevid=3,convert=raid1,soft /array > > # Continue converting all other metadata in the array: > btrfs balance start -mconvert=raid1,soft /array > > After metadata is converted to raid1, an intermittent drive connection is > a much more recoverable problem, and you can replace the broken disk at > your leisure. You'll get csum and IO errors when the drive disconnects, > but these errors will not be fatal to the filesystem as a whole because > the metadata will be safely written on other devices. > > 2. convert the metadata to raid1 as in option 1, then delete the missing > device. This is by far the slowest option, and only works if you have > sufficient space on the other drives for the new data. > > 3. convert the metadata to raid1 as in option 1, add more disks so that > there is enough space for the device delete in option 2, then proceed > with the device delete in option 2. This is probably worse than option > 2 in terms of potential failure modes, but I put it here for completeness. > > 4. when the replacement disk arrives, run 'btrfs replace' from the broken > disk to the new disk, then convert the metadata to raid1 as in option 1 > so you're not using dup metadata any more. This is as fast as the 'dd' > solution, but there is a slightly higher risk as the broken disk might > disconnect during a write and abort the replace operation. > > > Cheers,
