Re: [discuss] Disk resilvering problem

Udo Grabowski (IMK) Fri, 02 Aug 2024 05:53:34 -0700

On 08/07/2024 12:38, Udo Grabowski (IMK) wrote:

Hi,


we currently have a raid-z1 pool resilvering (two damaged devices
in different vdevs), but a third disk in one of the degraded vdevs
occasionally timeouts:

Jul  7 06:25:36 imksunth8 scsi: [ID 243001 kern.warning] WARNING: /scsi_vhci
(scsi_vhci0):
Jul  7 06:25:36 imksunth8       /scsi_vhci/disk@g5000cca2441ed63c (sd184):
Command Timeout on path mpt_sas4/disk@w5000cca2441ed63d,0

The problem: These hiccups cause the resilvering to RESTART ! Which
doesn't help to get the job quickly done, and just accelerates the
wear on the already unhealthy third disk, and will finally spiral down
to complete dataloss because of 2-disk-failure on a z1 vdev.

Is there a way to switchoff this behaviour via a kmdb parameter which
can be set while operating (it's an older illumos-cf25223258 from 2016) ?


Thanks to Toomas, Bronkoo, Keith, Marcel, and Bill for all the suggestions
to get this pool rescued - I had to use all of the mentioned options !

Especially the ddrescue disk clone tip from Keith was worth a ton of gold,
since that saved me in the very last moment from loosing the whole pool.
Cloned, checked the label directly before exchange for matching TXGs
again, and the exchange went smoothly, the label got updated correctly.

Indeed, after that I lost another 2nd disk in a vdev, and this time
that gave me 35k errors (interestingly, the pool didn't block), but all
except one single file was recoverable ! All in all 4 disks died in that
process, and 4 additional disks are considerably broken, but luckily all
spreaded across the vdevs, so no immediate problem if they fail now.

I started backup copy operations immediately, (despite the resilvering pool)
which are still ongoing, as there are 200 TB to save to a space I had to find
first, since we have no backup facility of that size affordable (the data
was ment to be in a temporary state that would be transformed and transferred to
an archive later, but it lasted a bit too long on that pool), so a bit of
begging and convincing central operations staff was necessary ...

As I can't 'zpool replace' the rest of the broken disks since the
broken expander chip causes all other disks to spit errors as hell,
I would go down the ddrescue path again to clone and directly replace
those disks, that seems to be a recommendable practice in such cases,
as the rest of the vdev disks are not accessed, but pool writes should be
strictly supressed during such an operation to have consistent TXGs.

I also found that when a disk replacement goes wild very early, its
the best thing to immediately pull the replacement and put back in
the disk to be replaced. That saved me from loosing all, in another
event.

Since the machine (an 8 year old DataOn CIB-9470V2 dual node
storage server) now also has failing 12V/5V power rails, that one
has to go now. Trying to get an all-flash dual node one, the NVME
ones are now in a reachable-for-us price region, and would be MUCH
faster.

So these are the essential lessons learned from this experience:

1.-10. BACKUP, if you can ...

11. Nothing is as permanent as temporary data ...

12. Check your disks health regularly, either smartmontools or a
    regular, but not too often applied scrub. Disk problems seem
    to often pile up silently, causing havoc and desaster when a
    more serious, but otherwise recoverable event arises.

13. In case of a 2nd disk failure in a vdev on the rise, clone the
    the worst one (or the one that is still accessible) before the
    fatal failure of both disks. smartmontools are your friend.

14. Do not export a failing pool. If you have to export, try all
    measures to get it into an easily recoverable state before.
    And do not switchoff the machine in panic.

15. Stop a disk replacement immediately if that goes downhill early.

16. Be patient. Let ZFS do its thing. It does it well.

17. ZFS rocks, it's simply the safest place on earth for your data !

Thanks again for all that valuable help !
--
Dr.Udo Grabowski  Inst.of Meteorology & Climate Research IMK-ASF-SAT
https://www.imk-asf.kit.edu/english/sat.php
KIT - Karlsruhe Institute of Technology          https://www.kit.edu
Postfach 3640,76021 Karlsruhe,Germany T:(+49)721 608-26026 F:-926026

smime.p7s
Description: S/MIME Cryptographic Signature


------------------------------------------
illumos: illumos-discuss
Permalink: 
https://illumos.topicbox.com/groups/discuss/T2a32a4cc427e4845-M15c7a9886dfc05c808b981af
Delivery options: https://illumos.topicbox.com/groups/discuss/subscription

Re: [discuss] Disk resilvering problem

Reply via email to