[developer] re-adding a drive to a pool causes resilver to start over?

Ken Merry Mon, 09 Jul 2018 14:41:07 -0700

Hi ZFS folks,

We (Spectra Logic) have seen some odd behavior with resilvers in RAIDZ3 pools.


The codebase in question is FreeBSD stable/11 from July 2017, at approximately 
FreeBSD SVN version 321310.

We have customer systems with (sometimes) hundreds of SMR drives in RAIDZ3 
vdevs in a large pool.  (A typical arrangement is a 23-drive RAIDZ3, and some 
customers will put everything in one giant pool made up of a number of 23-drive 
RAIDZ3 arrays.)

The SMR drives in question have a bug that sometimes causes them to go off the 
SAS bus for up to two minutes.  (They’re usually gone a lot less than that, up 
to 10 seconds.)  Once they come back online, zfsd puts the drive back in the 
pool and makes it online.

If a resilver is active on a different drive, once the drive that temporarily 
left comes back, the resilver apparently starts over from the beginning.

This leads to resilvers that take forever to complete, especially on systems 
with high load.

Is this expected behavior?

It seems that only one scan can be active on a pool at any given time.  Is that 
correct?  If so, is that true for an entire top level pool, or just a given 
redundancy group?  (In this case, it would be the RAIDZ3 vdev.)

Is there anything we can do to make sure the resilvers complete in a reasonable 
period of time or otherwise improve the behavior?  (Short of putting in 
different drives…I have already suggested that.)

Thanks,

Ken
— 
Ken Merry
[email protected]




------------------------------------------
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/T2a7340f4c0c48fa9-Mbb809915cbad84bbec97dd4c
Delivery options: https://openzfs.topicbox.com/groups

[developer] re-adding a drive to a pool causes resilver to start over?

Reply via email to