Pouya Tafti <pouya+lists.net...@nohup.io> writes: [snip]
> # zpool replace pond wedges/slot4zfs wedges/slot7zfs > > many hours ago. Since then, as I periodically check > zpool(8) status it appears that the various counters and > timers keep starting over, while the error rates keep > increasing. Most recently: > > # zpool status > > pool: pond > state: ONLINE > status: One or more devices is currently being resilvered. The pool will > continue to function, possibly in a degraded state. > action: Wait for the resilver to complete. > scan: resilver in progress since Sat Aug 14 21:02:49 2021 > 118G scanned out of 1.59T at 230M/s, 1h52m to go > 19.6G resilvered, 7.23% done > config: > > NAME STATE READ WRITE CKSUM > pond ONLINE 0 0 0 > raidz2-0 ONLINE 0 0 0 > wedges/slot0zfs ONLINE 0 0 0 > wedges/slot1zfs ONLINE 0 0 0 > wedges/slot2zfs ONLINE 0 0 0 > wedges/slot3zfs ONLINE 0 0 0 > replacing-4 ONLINE 0 0 945 > wedges/slot4zfs ONLINE 299 5.07K 0 (resilvering) > wedges/slot7zfs ONLINE 0 0 0 (resilvering) > wedges/slot5zfs ONLINE 0 0 0 > > errors: No known data errors > [snip] So... it looks like it may have tried to resilver the failing drive when you performed the replacement or had started to resilver the failing drive as you performed the replacement. In another OS with ZFS I have seen something like this resilver restarting behavior. In my case it ultimately finished I just had to wait a while. As a general rule, although I will say not a required-hard-rule, it would be a good idea to take not fully failed ZFS member offline before doing a replacement. If the member has failed completely, that is different, but if there is any chance that it may actually have function, it is better is offline it first and then replace. In this case, although I will admit I am not completely sure, I think you can still offline the failing drive and the resilvering of the replacement might proceed as you expect. I think what you may be seeing is that ZFS is trying to rebuild the failing drive from the rest of the raid members and at the same time trying to replace it and the churn of doing that may be tripping the restart or you are seeing threaded output of one resilver and then another. I believe that it is permitted to perform more than one at a time. -- Brad Spencer - b...@anduin.eldar.org - KC8VKS - http://anduin.eldar.org