On 7/9/18 23:39, Ken Merry wrote: > Hi ZFS folks, > > We (Spectra Logic) have seen some odd behavior with resilvers in RAIDZ3 pools. > > The codebase in question is FreeBSD stable/11 from July 2017, at > approximately FreeBSD SVN version 321310. > > We have customer systems with (sometimes) hundreds of SMR drives in RAIDZ3 > vdevs in a large pool. (A typical arrangement is a 23-drive RAIDZ3, and some > customers will put everything in one giant pool made up of a number of > 23-drive RAIDZ3 arrays.) > > The SMR drives in question have a bug that sometimes causes them to go off > the SAS bus for up to two minutes. (They’re usually gone a lot less than > that, up to 10 seconds.) Once they come back online, zfsd puts the drive > back in the pool and makes it online. > > If a resilver is active on a different drive, once the drive that temporarily > left comes back, the resilver apparently starts over from the beginning. > > This leads to resilvers that take forever to complete, especially on systems > with high load.
Since resilver is single threaded, adding the drive immediately doesn't buy you any additional redundancy. Maybe it would make sense for the zfsd to delay reinserting the drive until after ongoing resilver is done? -- Pawel Jakub Dawidek
signature.asc
Description: OpenPGP digital signature
This is a multi-part message in MIME format... ------------=_1531239460-225710-459--
