On 9 Jan 2024 13:25 -0500, from wande...@fastmail.fm (The Wanderer):
>>> Within the past few weeks, I got root-mail notifications from
>>> smartd that the ATA error count on two of the drives had increased
>>> - one from 0 to a fairly low value (I think between 10 and 20), the
>>> other from 0 to 1. I figured this was nothing to worry about -
>>> because of the relatively low values, because the other drives had
>>> not shown any such thing, and because of the expected stability and
>>> lifetime of good-quality SSDs.
>>> 
>>> On Sunday (two days ago), I got root-mail notifications from
>>> smartd about *all* of the drives in the array. This time, the total
>>> error counts had gone up to values in the multiple hundreds per
>>> drive. Since then (yesterday), I've also gotten further
>>> notification mails about at least one of the drives increasing
>>> further. So far today I have not gotten any such notifications.
> 
> Do you read the provided excerpt from the SMART data as indicating that
> there are hundreds of bad blocks, or that they are rising rapidly?

No; that was your claim, in the paragraph about Sunday's events.


> The Runtime_Bad_Block count for that drive is nonzero, but it is only 31.
> 
> What's high and seems as if it may be rising is the
> Uncorrectable_Error_Cnt value (attribute 187) - which I understand to
> represent *incidents* in which the drive attempted to read a sector or
> block and was unable to do so.

The drive may be performing internal housekeeping and in doing so try
to read those blocks, or something about your RAID array setup may be
doing so.

Exactly what are you using for RAID-6? mdraid? An off-board hardware
RAID HBA? Motherboard RAID? Or something else? What you say suggests
mdraid or something similar.


> I've ordered a 22TB external drive for the purpose of creating such a
> backup. Fingers crossed that things last long enough for it to get here
> and get the backup created.

I suggest selecting, installing and configuring (as much as possible)
whatever software you will use to actually perform the backup while
you wait for the drive to arrive. It might save you a little time
later. Opinions differ but I like rsnapshot myself; it's really just a
front-end for rsync, so the copy is simply files, making partial or
full restoration easy without any special tools.


> dmesg does have what appears to be an error entry for each of the events
> reported in the alert mails, correlated with the devices in question. I
> can provide a sample of one of those, if desired.

As long as the drive is being honest about failures and is reporting
failures rapidly, the RAID array can do its work. What you absolutely
don't want to see is I/O errors relating to the RAID array device (for
example, with mdraid, /dev/md*), because that would presumably mean
that the redundancy was insufficient to correct for the failure. If
that happens, you are falling off a proverbial cliff.

-- 
Michael Kjörling                     🔗 https://michael.kjorling.se
“Remember when, on the Internet, nobody cared that you were a dog?”

Reply via email to