I started to see r/w errors from one of my SAS drives after a routine zpool(8) scrub (dmesg is littered with ACK/NAK timeout errors). Since I have a pair of spares I thought I'd replace the drive before investigating further (the HDD, controller, cables, and backplate are all old and suspect).
I wasn't sure whether in such a case one should take the problematic drive offline and resilver, or a simple replace would do, but assumed zfs would be smart enough to do the Right Thing as it knows about the errors. So I issued # zpool replace pond wedges/slot4zfs wedges/slot7zfs many hours ago. Since then, as I periodically check zpool(8) status it appears that the various counters and timers keep starting over, while the error rates keep increasing. Most recently: # zpool status pool: pond state: ONLINE status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Sat Aug 14 21:02:49 2021 118G scanned out of 1.59T at 230M/s, 1h52m to go 19.6G resilvered, 7.23% done config: NAME STATE READ WRITE CKSUM pond ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 wedges/slot0zfs ONLINE 0 0 0 wedges/slot1zfs ONLINE 0 0 0 wedges/slot2zfs ONLINE 0 0 0 wedges/slot3zfs ONLINE 0 0 0 replacing-4 ONLINE 0 0 945 wedges/slot4zfs ONLINE 299 5.07K 0 (resilvering) wedges/slot7zfs ONLINE 0 0 0 (resilvering) wedges/slot5zfs ONLINE 0 0 0 errors: No known data errors which seems to show the process started most recently at 21:02 but this has been going on since midday. The only difference I have noticed is that initially only the new device was being reslivered but now both the old and the new appear to be. I'm new to ZFS and this is the first time I'm dealing with disk errors. So I don't know if this is normal behaviour and I should just wait or if I was wrong to issue replace rather than take the drive offline and resilver from the rest. If this is not normal, (how) can I recover? Many thanks, Pouya