On 7/11/25 2:05 PM, Crystal Kolipe wrote:
On Fri, Jul 11, 2025 at 09:55:23AM -0700, Kevin Williams wrote:
I certainly want to detect a failing drive for my source data and replace it
before corrupted data is backed up from it.
No matter what techniques you use, this is more difficult than it might seem.
Copying data from one place to another means relying on the firmware on the
source device, the interconnection to the CPU, (E.G. sata cable, scsi cable,
whatever), the OS device driver for the source device, filesystem and buffer
cache management code, the stability of the system RAM whilst the data is in
the buffer cache, and most of the same things on the way to the target device.
Copying around a few Mb of data you might not notice problems even on a flakey
system. Once you get in to the Tb range or a fair fraction of a petabyte,
unexpected errors are more likely to show up even on seemingly perfect
hardware.
Different techniques save you from different failure scenarios.
A complicated filesystem that does it all for you behind the scenes might seem
great at first, but if and when it does eventually come crashing down, your
options to do any recovery are going to be restricted by the complexity of the
thing in the first place.
Sure, you should have backups. But there are times when you have 99% of your
data backed up, but still want to recover that one file that was critically
modified just after the last time it hit the tape.
This is the counter-argument to those who insist that media longevity doesn't
matter, because finding a drive to read it back in 25 years time will b
impossible, (which in many cases is dubious anyway).
15-20 years has been successful for me. I still have data from 1983.
It has required creative scrounging for old hardware.
Migrating your data from one place to another repeatedly could easily
introduce bit errors that go un-noticed. Don't assume that the drive's own
CRC checking will catch it.
[ snip ]
Very good advice.
System failures as well as mass storage errors must be considered.
Some problems I've encountered:
Main memory without ECC or parity is a silent point of
single bit failure and far more common than one might expect.
As of three years ago, research reports said disks are
more likely to have block read errors or fail completely than
have bit errors. Sudden complete failure with no warning
or single read error was common.
A bad power supply can fry your entire disk array in seconds
or cause flaky errors without crashing the system.
Blocked, clogged or failed fans can cause silent data corruption
and catastrophic system failures.
Backup optical and tape drives can silently write garbage.
NAS grade disks, frequent backups with occasional offsite storage,
checksums taken frequently on all data, monitoring disks and temperature,
and watching for flaky applications have served well.
There is a danger of over-thinking things here.
Just about any system will occasionally read a bad bit from disk, whatever
anybody tells you. What matters is that the application detects that and
doesn't process bad data as good.
For a home data storage setup, just use a regular disk and regular backups.
If you need good uptime, (E.G. a music server for a radio station), then use
a raid-1 mirror.
If your data is valuable, (a book you spent 5 years writing), then make
multiple backups, store them properly, and keep copies of the sha512 hashes
in multiple places so that you know when you read the backup if it's good or
not.
[snip]
More good advice.
geoff steckel