zfs resilver in(de)finite loop?

Pouya Tafti Sat, 14 Aug 2021 14:29:24 -0700

I started to see r/w errors from one of my SAS drives after
a routine zpool(8) scrub (dmesg is littered with ACK/NAK
timeout errors).  Since I have a pair of spares I thought
I'd replace the drive before investigating further (the
HDD, controller, cables, and backplate are all old and
suspect).


I wasn't sure whether in such a case one should take the
problematic drive offline and resilver, or a simple replace
would do, but assumed zfs would be smart enough to do the
Right Thing as it knows about the errors.  So I issued

# zpool replace pond wedges/slot4zfs wedges/slot7zfs

many hours ago.  Since then, as I periodically check
zpool(8) status it appears that the various counters and
timers keep starting over, while the error rates keep
increasing.  Most recently:

# zpool status

  pool: pond
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sat Aug 14 21:02:49 2021
        118G scanned out of 1.59T at 230M/s, 1h52m to go
        19.6G resilvered, 7.23% done
config:

        NAME                   STATE     READ WRITE CKSUM
        pond                   ONLINE       0     0     0
          raidz2-0             ONLINE       0     0     0
            wedges/slot0zfs    ONLINE       0     0     0
            wedges/slot1zfs    ONLINE       0     0     0
            wedges/slot2zfs    ONLINE       0     0     0
            wedges/slot3zfs    ONLINE       0     0     0
            replacing-4        ONLINE       0     0   945
              wedges/slot4zfs  ONLINE     299 5.07K     0  (resilvering)
              wedges/slot7zfs  ONLINE       0     0     0  (resilvering)
            wedges/slot5zfs    ONLINE       0     0     0

errors: No known data errors

which seems to show the process started most recently at
21:02 but this has been going on since midday.  The only
difference I have noticed is that initially only the new
device was being reslivered but now both the old and the
new appear to be.

I'm new to ZFS and this is the first time I'm dealing with
disk errors.  So I don't know if this is normal behaviour
and I should just wait or if I was wrong to issue replace
rather than take the drive offline and resilver from the
rest.  If this is not normal, (how) can I recover?

Many thanks,
Pouya

zfs resilver in(de)finite loop?

Reply via email to