You're welcome. ddrescue has saved the day enough times for me to not share the concept!
Scrubbing is also very underrated! I spent a good overnight recovery session from tape backups back in the early 2000's (before ddrescue was even a concept) because a raid 5 with hot spare drive array had a disk error and unfortunately one of the other data devices that was being used to rebuild the spare also had a bad sector issue towards the end of reconstruction which caused the whole thing to go down hours into a rebuild. Regular scrubbing would have caught it. I have my main zfs box running Samsung PM1643a Enterprise SSDs with scrubbing weekly, and nightly zfs sends to spinning rust mirror as backup, fingers crossed that's enough, short of doing off-site storage. I did have a recent scare though, I think after 444 days of uptime there must have been a kernel memory corruption or leak or something in the mpt_sas driver that caused the pool to drop a device and go degraded... Feb 4 04:12:12 box scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci15ad,7a0@15/pci15d9,691@0 (mpt_sas0): Feb 4 04:12:12 box unable to kmem_alloc enough memory for scatter/gather list Feb 4 04:13:11 box fmd: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS-8000-GH, TYPE: Fault, VER: 1, SEVERITY: Major Feb 4 04:13:11 box EVENT-TIME: Sun Feb 4 04:13:07 GMT 2024 Feb 4 04:13:11 box PLATFORM: VMware-Virtual-Platform, CSN: VMware-56-4d-d5-d0-8b-95-d8-73-3f-c1-70-75-09-e7-6d-76, HOSTNAME: box Feb 4 04:13:11 box SOURCE: zfs-diagnosis, REV: 1.0 Feb 4 04:13:11 box EVENT-ID: f4fc2f70-e629-4650-97a2-d6e6335ea6d4 Feb 4 04:13:11 box DESC: The number of checksum errors associated with a ZFS device Feb 4 04:13:11 box exceeded acceptable levels. Refer to http://illumos.org/msg/ZFS-8000-GH for more information. Feb 4 04:13:11 box AUTO-RESPONSE: The device has been marked as degraded. An attempt Feb 4 04:13:11 box will be made to activate a hot spare if available. Feb 4 04:13:11 box IMPACT: Fault tolerance of the pool may be compromised. Feb 4 04:13:11 box REC-ACTION: Run 'zpool status -x' and replace the bad device. Feb 4 04:15:02 box nvidia_modeset: [ID 107833 kern.notice] Unloading Ironically this was during a scrub! I started thinking the worst, that the expensive SSDs were about to all fail similar age, but then saw the kmem_alloc error and realised then it wasn't hardware... but could there have been a consequential problem caused by this?! Thankfully a reboot and zpool clear was enough, and an immediate scrub completed ok. Keith -----Original Message----- From: Udo Grabowski (IMK) <udo.grabow...@kit.edu> Sent: 02 August 2024 13:53 To: discuss@lists.illumos.org Subject: Re: [discuss] Disk resilvering problem On 08/07/2024 12:38, Udo Grabowski (IMK) wrote: > Hi, > > we currently have a raid-z1 pool resilvering (two damaged devices in > different vdevs), but a third disk in one of the degraded vdevs > occasionally timeouts: > > > The problem: These hiccups cause the resilvering to RESTART ! Which > doesn't help to get the job quickly done, and just accelerates the > wear on the already unhealthy third disk, and will finally spiral down > to complete dataloss because of 2-disk-failure on a z1 vdev. Thanks to Toomas, Bronkoo, Keith, Marcel, and Bill for all the suggestions to get this pool rescued - I had to use all of the mentioned options ! Especially the ddrescue disk clone tip from Keith was worth a ton of gold, since that saved me in the very last moment from loosing the whole pool. Cloned, checked the label directly before exchange for matching TXGs again, and the exchange went smoothly, the label got updated correctly. Indeed, after that I lost another 2nd disk in a vdev, and this time that gave me 35k errors (interestingly, the pool didn't block), but all except one single file was recoverable ! All in all 4 disks died in that process, and 4 additional disks are considerably broken, but luckily all spreaded across the vdevs, so no immediate problem if they fail now. I started backup copy operations immediately, (despite the resilvering pool) which are still ongoing, as there are 200 TB to save to a space I had to find first, since we have no backup facility of that size affordable (the data was ment to be in a temporary state that would be transformed and transferred to an archive later, but it lasted a bit too long on that pool), so a bit of begging and convincing central operations staff was necessary ... As I can't 'zpool replace' the rest of the broken disks since the broken expander chip causes all other disks to spit errors as hell, I would go down the ddrescue path again to clone and directly replace those disks, that seems to be a recommendable practice in such cases, as the rest of the vdev disks are not accessed, but pool writes should be strictly supressed during such an operation to have consistent TXGs. I also found that when a disk replacement goes wild very early, its the best thing to immediately pull the replacement and put back in the disk to be replaced. That saved me from loosing all, in another event. Since the machine (an 8 year old DataOn CIB-9470V2 dual node storage server) now also has failing 12V/5V power rails, that one has to go now. Trying to get an all-flash dual node one, the NVME ones are now in a reachable-for-us price region, and would be MUCH faster. So these are the essential lessons learned from this experience: 1.-10. BACKUP, if you can ... 11. Nothing is as permanent as temporary data ... 12. Check your disks health regularly, either smartmontools or a regular, but not too often applied scrub. Disk problems seem to often pile up silently, causing havoc and desaster when a more serious, but otherwise recoverable event arises. 13. In case of a 2nd disk failure in a vdev on the rise, clone the the worst one (or the one that is still accessible) before the fatal failure of both disks. smartmontools are your friend. 14. Do not export a failing pool. If you have to export, try all measures to get it into an easily recoverable state before. And do not switchoff the machine in panic. 15. Stop a disk replacement immediately if that goes downhill early. 16. Be patient. Let ZFS do its thing. It does it well. 17. ZFS rocks, it's simply the safest place on earth for your data ! Thanks again for all that valuable help ! -- Dr.Udo Grabowski Inst.of Meteorology & Climate Research IMK-ASF-SAT https://www.imk-asf.kit.edu/english/sat.php KIT - Karlsruhe Institute of Technology https://www.kit.edu Postfach 3640,76021 Karlsruhe,Germany T:(+49)721 608-26026 F:-926026 ------------------------------------------ illumos: illumos-discuss Permalink: https://illumos.topicbox.com/groups/discuss/T2a32a4cc427e4845-Ma1749cd11519219c9ba121e0 Delivery options: https://illumos.topicbox.com/groups/discuss/subscription