On Mon, 2015-11-30 at 13:17 -0700, Chris Murphy wrote: > On Mon, Nov 30, 2015 at 7:51 AM, Austin S Hemmelgarn > <[email protected]> wrote: > > > General thoughts on this: > > 1. If there's a write error, we fail unconditionally right now. It > > would be > > nice to have a configurable number of retries before failing. > > I'm unconvinced. I pretty much immediately do not trust a block > device > that fails even a single write, and I'd expect the file system to > quickly get confused if it can't rely on flushing pending writes to > that device. From my large-amounts-of-storage-admin PoV,... I'd say it would be nice to have more knobs to control when exactly a device is considered no longer perfectly fine, which can include several different stages like: - perhaps unreliable e.g. maybe the device shows SMART problems or there were correctable read and/or write errors under a certain threshold (either in total, or per time period) Then I could imagine that one can control whether the device is put - continued to be normally used until certain error thresholds are exceeded. - placed in a mode where data is still written to, but only when there's a duplicate on at least on other good device,... so the device would be used as read pool maybe optionally, data already on the device is auto-replicated to good devices - offline (perhaps only to be automatically reused in case of emergency (as a hot spare) when the fs knows that otherwise it's even more likely that data would be lost soon - failed the threshold from above has been reached, the fs suspects the device to completely fail soon Possible knobs would include how aggressively data is tried to move of the device. How often should retries be made? In case the other devices are under high IO load how much percentage should be used to get the still working data of the bad device (i.e. up to 100%, meaning "rather stop any other IO, just to move the data to good devices ASAP)? - dead accesses don't work anymore at all an the fs shouldn't even waste time trying to read/recover data from it.
It would also make sense to allow tuning what conditions need be met to e.g. consider a drive unreliable (e.g. which SMART errors?) and to allow an admin to manually place a drive in a certain state (e.g. SMART would be still good, no IO errors so far, but the drive is 5 year old and I better want to consider it unreliable). That's - to some extent - what we at our LHC Tier-2 do at higher levels (partly simply by human management, partly via the storage management system we use (dCache), partly by RAID and other tools and scripting). In any case, though,... any of these knobs should IMHO default to the most conservative settings. In other words: If a device shows the slightest hint of being unstable/unreliable/failed... it should be considered bad, no new data should go on it (if necessary, because not enough other devices are left, the fs should get ro). The only thing I wouldn't have a opinion is: should the fs go ro and do nothing, waiting for a human to decide what's next, or should it go ro and (if possible) try to move data off the bad device (per default). Generally, a filesystem should be safe per default (which is why I see the issue in the other thread with the corruption/security leaks in case of UUID collisions quite a showstopper). From the admin side, I don't want to be required to make it safe,.. my interaction should rather only be needed to tune things. Of course I'm aware that btrfs brings several techniques which make it unavoidable that more maintenance is put into the filesystem, but, per default, this should be minimised as far as possible. Cheers, Chris.
smime.p7s
Description: S/MIME cryptographic signature
