Same here, the resilver will be restarted. Large zpools (in terms of
number of vdevs) should be avoided, if at all possible.
Rafael
On 7/9/18 2:57 PM, Jason Matthews wrote:
I cannot recall a time where manipulating a pool during a scrub Or resilver
did not restart the scrub/resilver operation.
J.
Sent from my iPhone
On Jul 9, 2018, at 2:39 PM, Ken Merry <[email protected]> wrote:
Hi ZFS folks,
We (Spectra Logic) have seen some odd behavior with resilvers in RAIDZ3 pools.
The codebase in question is FreeBSD stable/11 from July 2017, at approximately
FreeBSD SVN version 321310.
We have customer systems with (sometimes) hundreds of SMR drives in RAIDZ3
vdevs in a large pool. (A typical arrangement is a 23-drive RAIDZ3, and some
customers will put everything in one giant pool made up of a number of 23-drive
RAIDZ3 arrays.)
The SMR drives in question have a bug that sometimes causes them to go off the
SAS bus for up to two minutes. (They’re usually gone a lot less than that, up
to 10 seconds.) Once they come back online, zfsd puts the drive back in the
pool and makes it online.
If a resilver is active on a different drive, once the drive that temporarily
left comes back, the resilver apparently starts over from the beginning.
This leads to resilvers that take forever to complete, especially on systems
with high load.
Is this expected behavior?
It seems that only one scan can be active on a pool at any given time. Is that
correct? If so, is that true for an entire top level pool, or just a given
redundancy group? (In this case, it would be the RAIDZ3 vdev.)
Is there anything we can do to make sure the resilvers complete in a reasonable
period of time or otherwise improve the behavior? (Short of putting in
different drives…I have already suggested that.)
Thanks,
Ken
—
Ken Merry
[email protected]
------------------------------------------
openzfs: openzfs-developer
Permalink:
https://openzfs.topicbox.com/groups/developer/T2a7340f4c0c48fa9-Md4aa3db0ae008eef4ed75297
Delivery options: https://openzfs.topicbox.com/groups