RE: [discuss] Disk resilvering problem

Keith Hall via illumos-discuss Fri, 02 Aug 2024 11:44:16 -0700

You're welcome. ddrescue has saved the day enough times for me to not share the 
concept!

Scrubbing is also very underrated! I spent a good overnight recovery session 
from tape backups back in the early 2000's (before ddrescue was even a concept) 
because a raid 5 with hot spare drive array had a disk error and unfortunately 
one of the other data devices that was being used to rebuild the spare also had 
a bad sector issue towards the end of reconstruction which caused the whole 
thing to go down hours into a rebuild. Regular scrubbing would have caught it.

I have my main zfs box running Samsung PM1643a Enterprise SSDs with scrubbing 
weekly, and nightly zfs sends to spinning rust mirror as backup, fingers 
crossed that's enough, short of doing off-site storage.

I did have a recent scare though, I think after 444 days of uptime there must 
have been a kernel memory corruption or leak or something in the mpt_sas driver 
that caused the pool to drop a device and go degraded...

Feb  4 04:12:12 box scsi: [ID 107833 kern.warning] WARNING: 
/pci@0,0/pci15ad,7a0@15/pci15d9,691@0 (mpt_sas0):
Feb  4 04:12:12 box     unable to kmem_alloc enough memory for scatter/gather 
list
Feb  4 04:13:11 box fmd: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS-8000-GH, 
TYPE: Fault, VER: 1, SEVERITY: Major
Feb  4 04:13:11 box EVENT-TIME: Sun Feb  4 04:13:07 GMT 2024
Feb  4 04:13:11 box PLATFORM: VMware-Virtual-Platform, CSN: 
VMware-56-4d-d5-d0-8b-95-d8-73-3f-c1-70-75-09-e7-6d-76, HOSTNAME: box
Feb  4 04:13:11 box SOURCE: zfs-diagnosis, REV: 1.0
Feb  4 04:13:11 box EVENT-ID: f4fc2f70-e629-4650-97a2-d6e6335ea6d4
Feb  4 04:13:11 box DESC: The number of checksum errors associated with a ZFS 
device
Feb  4 04:13:11 box exceeded acceptable levels.  Refer to 
http://illumos.org/msg/ZFS-8000-GH for more information.
Feb  4 04:13:11 box AUTO-RESPONSE: The device has been marked as degraded.  An 
attempt
Feb  4 04:13:11 box will be made to activate a hot spare if available.
Feb  4 04:13:11 box IMPACT: Fault tolerance of the pool may be compromised.
Feb  4 04:13:11 box REC-ACTION: Run 'zpool status -x' and replace the bad 
device.
Feb  4 04:15:02 box nvidia_modeset: [ID 107833 kern.notice] Unloading

Ironically this was during a scrub!

I started thinking the worst, that the expensive SSDs were about to all fail 
similar age, but then saw the kmem_alloc error and realised then it wasn't 
hardware... but could there have been a consequential problem caused by this?!

Thankfully a reboot and zpool clear was enough, and an immediate scrub 
completed ok.

Keith

-----Original Message-----
From: Udo Grabowski (IMK) <udo.grabow...@kit.edu> 
Sent: 02 August 2024 13:53
To: discuss@lists.illumos.org
Subject: Re: [discuss] Disk resilvering problem

On 08/07/2024 12:38, Udo Grabowski (IMK) wrote:
> Hi,
>
> we currently have a raid-z1 pool resilvering (two damaged devices in 
> different vdevs), but a third disk in one of the degraded vdevs 
> occasionally timeouts:
>
>
> The problem: These hiccups cause the resilvering to RESTART ! Which 
> doesn't help to get the job quickly done, and just accelerates the 
> wear on the already unhealthy third disk, and will finally spiral down 
> to complete dataloss because of 2-disk-failure on a z1 vdev.

Thanks to Toomas, Bronkoo, Keith, Marcel, and Bill for all the suggestions to 
get this pool rescued - I had to use all of the mentioned options !

Especially the ddrescue disk clone tip from Keith was worth a ton of gold, 
since that saved me in the very last moment from loosing the whole pool.
Cloned, checked the label directly before exchange for matching TXGs again, and 
the exchange went smoothly, the label got updated correctly.

Indeed, after that I lost another 2nd disk in a vdev, and this time that gave 
me 35k errors (interestingly, the pool didn't block), but all except one single 
file was recoverable ! All in all 4 disks died in that process, and 4 
additional disks are considerably broken, but luckily all spreaded across the 
vdevs, so no immediate problem if they fail now.

I started backup copy operations immediately, (despite the resilvering pool) 
which are still ongoing, as there are 200 TB to save to a space I had to find 
first, since we have no backup facility of that size affordable (the data was 
ment to be in a temporary state that would be transformed and transferred to an 
archive later, but it lasted a bit too long on that pool), so a bit of begging 
and convincing central operations staff was necessary ...

As I can't 'zpool replace' the rest of the broken disks since the broken 
expander chip causes all other disks to spit errors as hell, I would go down 
the ddrescue path again to clone and directly replace those disks, that seems 
to be a recommendable practice in such cases, as the rest of the vdev disks are 
not accessed, but pool writes should be strictly supressed during such an 
operation to have consistent TXGs.

I also found that when a disk replacement goes wild very early, its the best 
thing to immediately pull the replacement and put back in the disk to be 
replaced. That saved me from loosing all, in another event.

Since the machine (an 8 year old DataOn CIB-9470V2 dual node storage server) 
now also has failing 12V/5V power rails, that one has to go now. Trying to get 
an all-flash dual node one, the NVME ones are now in a reachable-for-us price 
region, and would be MUCH faster.

So these are the essential lessons learned from this experience:

1.-10. BACKUP, if you can ...

11. Nothing is as permanent as temporary data ...

12. Check your disks health regularly, either smartmontools or a
     regular, but not too often applied scrub. Disk problems seem
     to often pile up silently, causing havoc and desaster when a
     more serious, but otherwise recoverable event arises.

13. In case of a 2nd disk failure in a vdev on the rise, clone the
     the worst one (or the one that is still accessible) before the
     fatal failure of both disks. smartmontools are your friend.

14. Do not export a failing pool. If you have to export, try all
     measures to get it into an easily recoverable state before.
     And do not switchoff the machine in panic.

15. Stop a disk replacement immediately if that goes downhill early.

16. Be patient. Let ZFS do its thing. It does it well.

17. ZFS rocks, it's simply the safest place on earth for your data !

Thanks again for all that valuable help !
--
Dr.Udo Grabowski  Inst.of Meteorology & Climate Research IMK-ASF-SAT 
https://www.imk-asf.kit.edu/english/sat.php
KIT - Karlsruhe Institute of Technology          https://www.kit.edu
Postfach 3640,76021 Karlsruhe,Germany T:(+49)721 608-26026 F:-926026

------------------------------------------
illumos: illumos-discuss
Permalink: 
https://illumos.topicbox.com/groups/discuss/T2a32a4cc427e4845-Ma1749cd11519219c9ba121e0
Delivery options: https://illumos.topicbox.com/groups/discuss/subscription

RE: [discuss] Disk resilvering problem

Reply via email to