Re: SMART Uncorrectable_Error_Cnt rising - should I be worried?

The Wanderer Wed, 14 Feb 2024 18:55:17 -0800

TL;DR: It worked! I'm back up and running, with what appears to be all
my data safely recovered from the failing storage stack!

On 2024-01-09 at 14:22, The Wanderer wrote:

> On 2024-01-09 at 14:01, Michael Kjörling wrote:
> 
>> On 9 Jan 2024 13:25 -0500, from wande...@fastmail.fm (The
>> Wanderer):

>>> I've ordered a 22TB external drive for the purpose of creating
>>> such a backup. Fingers crossed that things last long enough for
>>> it to get here and get the backup created.
>> 
>> I suggest selecting, installing and configuring (as much as 
>> possible) whatever software you will use to actually perform the 
>> backup while you wait for the drive to arrive. It might save you a 
>> little time later. Opinions differ but I like rsnapshot myself;
>> it's really just a front-end for rsync, so the copy is simply
>> files, making partial or full restoration easy without any special
>> tools.
> 
> My intention was to shut down everything that normally runs, log out
> as the user who normally runs it, log in as root (whose home
> directory, like the main installed system, is on a different RAID
> array with different backing drives), and use rsync from that point.
> My understanding is that in that arrangement, the only thing
> accessing the RAID-6 array should be the rsync process itself.
> 
> For additional clarity: the RAID-6 array is backing a pair of
> logical volumes, which are backing the /home and /opt partitions. The
> entire rest of the system is on a series of other logical volumes
> which are backed by a RAID-1 array, which is based on entirely
> different drives (different model, different form factor, different
> capacity, I think even different connection technology) and which has
> not seen any warnings arise.
> 
>>> dmesg does have what appears to be an error entry for each of
>>> the events reported in the alert mails, correlated with the
>>> devices in question. I can provide a sample of one of those, if
>>> desired.
>> 
>> As long as the drive is being honest about failures and is
>> reporting failures rapidly, the RAID array can do its work. What
>> you absolutely don't want to see is I/O errors relating to the RAID
>> array device (for example, with mdraid, /dev/md*), because that
>> would presumably mean that the redundancy was insufficient to
>> correct for the failure. If that happens, you are falling off a
>> proverbial cliff.
> 
> Yeah, *that* would be indicative of current catastrophic failure. I
> have not seen any messages related to the RAID array itself.

In the time since this, I continued mostly-normal but somewhat-curtailed
use of the system, and saw few messages about these matters that did not
arise from attempts to back up the data for later recovery purposes.

> (For awareness: this is all a source of considerable psychological 
> stress to me, to an extent that is leaving me on the edge of
> physically ill, and I am managing to remain on the good side of that
> line only by minimizing my mental engagement with the issue as much
> as possible. I am currently able to read and respond to these mails
> without pressing that line, but that may change at any moment, and if
> so I will stop replying without notice until things change again.)

This need to stop reading wound up happening almost immediately after I
sent the message to which I am replying.

I now, however, have good news to report back: after more than a month,
at least one change of plans, nearly $2200 in replacement hard drives,
much nervous stress, several days of running data copies to and from a
20+-terabyte mechanical hard drive over USB, and a complete manual
removal of my old 8-drive RAID-6 array and build of a new 6-drive RAID-6
array (and of the LVM structure on top of it), I now appear to have
complete success.

I am now running on a restored copy of the data on the affected
partitions, taken from a nearly-fully-shut-down system state, which is
sitting on a new RAID-6 array built on what I understand to be
data-center-class SSDs (which should, therefore, be more suitable to the
24/7-uptime read-mostly workload I expect of my storage). The current
filesystems involved are roughly the same size as the ones previously in
use, but the underlying drives are nearly 2x the size; I decided to
leave the extra capacity for later allocation via LVM, if and when I may
need it.

I did my initial data backup to the external drive, from a
still-up-and-running system, via rsnapshot. Attempting to do a second
rsnapshot, however, failed at the 'cp -al' stage with "too many
hardlinks" errors. It turns out that there is a hard limit of 65000
hardlinks per on-disk file; I had so many files already hardlinked
together on the source filesystem that trying to hardlink each one to
just as many new names as there were already hardlinks for that file ran
into that limit.

(The default rsnapshot configuration doesn't preserve hardlinks,
possibly in order to avoid this exact problem - but that isn't viable
for the case I had at hand, because in some cases I *need* to preserve
the hardlink status, and because without that deduplication there
wouldn't have been enough space on the drive for more than the single
copy, in which case there'd be very little point in using rsnapshot
rather than just rsync.)

In the end, after several flailing-around attempts to minimize or
mitigate that problem, I wound up moving the initial external copy of
the biggest hardlink-deduplicated tree (which is essentially 100%
read-only at this point; it's backup copies of an old system state,
preserved since one of those copies has corrupted data and I haven't yet
been able to confirm that all of the files in my current copy of that
data were taken from the non-corrupt version) out of the way, shutting
down all parts of the system that might be writing to the affected
filesystems, and manually copying out the final state of the *other*
parts of those filesystems via rsync, bypassing rsnapshot. That was on
Saturday the 10th.

Then I grabbed copies of various metadata about the filesystems, the
LVM, and the mdraid config; modified /etc/fstab to not mount them;
deactivated the mdraid, and commented it out of /etc/mdadm/mdadm.conf;
updated the initramfs; shut down; pulled all eight Samsung 870 EVO
drives; installed six brand-new Intel data-center-class (or so I gather)
SSDs; booted up; partitioned the new drives based on the data I had
about what config the Debian installer put in place when creating the
mdraid config on the old ones; created a new mdraid RAID-6 array on
them, based on the copied metadata; created a new LVM stack on top of
that, based on *that* copied metadata; created new filesystems on top of
that, based on *that* copied metadata; rsync'ed the data in from the
manually-created external backup; adjusted /etc/fstab and
/etc/mdadm/mdadm.conf to reflect the new UUID and names of the new
storage configuration; updated the initramfs; and rebooted. Given delay
times for the drives to arrive and for various data-validation and
plan-double-checking steps to complete, the end of that process happened
this afternoon.

And it appears to Just Work. I haven't examined all the data to validate
that it's in good condition, obviously (since there's nearly 3TB of it),
but the parts I use on a day-to-day basis are all looking exactly the
way they should be. It appears that the cross-drive redundancy of the
RAID-6 array was enough to have avoided avoid data loss from the
scattered read failures of the underlying drives before I could get the
data out.

(This does leave me without having restored the read-only backup data
from the old system state. I care less about that; I'll want it
eventually, but it isn't important enough to warrant postponing getting
the system back in working order.)

I do still want/need to figure out what to do about an *actual* backup
system, to external storage, since the rsnapshot thing apparently isn't
going to be viable for my circumstance and use case. There is, however,
now *time* to work on doing that, without living under the shadow of a
known immediate/imminent data-loss hardware failure.

I also do mean to read the rest of the replies in this thread, now that
doing so is unlikely to aggravate my stress heartburn...

-- 
   The Wanderer

The reasonable man adapts himself to the world; the unreasonable one
persists in trying to adapt the world to himself. Therefore all
progress depends on the unreasonable man.         -- George Bernard Shaw

signature.asc
Description: OpenPGP digital signature

Re: SMART Uncorrectable_Error_Cnt rising - should I be worried?

Reply via email to