Re: [RFC] Btrfs device and pool management (wip)

Christoph Anton Mitterer Tue, 08 Dec 2015 20:39:34 -0800

On Mon, 2015-11-30 at 13:17 -0700, Chris Murphy wrote:
> On Mon, Nov 30, 2015 at 7:51 AM, Austin S Hemmelgarn
> <[email protected]> wrote:
> 
> > General thoughts on this:
> > 1. If there's a write error, we fail unconditionally right now.  It
> > would be
> > nice to have a configurable number of retries before failing.
> 
> I'm unconvinced. I pretty much immediately do not trust a block
> device
> that fails even a single write, and I'd expect the file system to
> quickly get confused if it can't rely on flushing pending writes to
> that device.
From my large-amounts-of-storage-admin PoV,... I'd say it would be nice
to have more knobs to control when exactly a device is considered no
longer perfectly fine, which can include several different stages like:
- perhaps unreliable
  e.g. maybe the device shows SMART problems or there were correctable 
  read and/or write errors under a certain threshold (either in total,
  or per time period)
  Then I could imagine that one can control whether the device is put 
  - continued to be normally used until certain error thresholds are
    exceeded.
  - placed in a mode where data is still written to, but only when
    there's a duplicate on at least on other good device,... so the
    device would be used as read pool
    maybe optionally, data already on the device is auto-replicated to
    good devices
  - offline (perhaps only to be automatically reused in case of
    emergency (as a hot spare) when the fs knows that otherwise it's
   
even more likely that data would be lost soon
- failed
  the threshold
from above has been reached, the fs suspects the
  device to completely
fail soon
  Possible knobs would include how aggressively data is tried
to move
  of the device.
  How often should retries be made? In case the
other devices are
  under high IO load how much percentage should be
used to get the
  still working data of the bad device (i.e. up to 100%,
meaning 
  "rather stop any other IO, just to move the data to good
devices 
  ASAP)? 
- dead
  accesses don't work anymore at all an the fs shouldn't even waste 
  time trying to read/recover data from it.


It would also make sense to allow tuning what conditions need be met to
e.g. consider a drive unreliable (e.g. which SMART errors?) and to
allow an admin to manually place a drive in a certain state (e.g. SMART
would be still good, no IO errors so far, but the drive is 5 year old
and I better want to consider it unreliable).


That's - to some extent - what we at our LHC Tier-2 do at higher levels
(partly simply by human management, partly via the storage management
system we use (dCache), partly by RAID and other tools and scripting).



In any case, though,... any of these knobs should IMHO default to the
most conservative settings.
In other words: If a device shows the slightest hint of being
unstable/unreliable/failed... it should be considered bad, no new data
should go on it (if necessary, because not enough other devices are
left, the fs should get ro).
The only thing I wouldn't have a opinion is: should the fs go ro and do
nothing, waiting for a human to decide what's next, or should it go ro
and (if possible) try to move data off the bad device (per default).

Generally, a filesystem should be safe per default (which is why I see
the issue in the other thread with the corruption/security leaks in
case of UUID collisions quite a showstopper).
From the admin side, I don't want to be required to make it safe,.. my
interaction should rather only be needed to tune things.

Of course I'm aware that btrfs brings several techniques which make it
unavoidable that more maintenance is put into the filesystem, but, per
default, this should be minimised as far as possible.


Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature

Re: [RFC] Btrfs device and pool management (wip)

Reply via email to