On Fri, Jul 11, 2025 at 09:55:23AM -0700, Kevin Williams wrote: > I certainly want to detect a failing drive for my source data and replace it > before corrupted data is backed up from it.
No matter what techniques you use, this is more difficult than it might seem. Copying data from one place to another means relying on the firmware on the source device, the interconnection to the CPU, (E.G. sata cable, scsi cable, whatever), the OS device driver for the source device, filesystem and buffer cache management code, the stability of the system RAM whilst the data is in the buffer cache, and most of the same things on the way to the target device. Copying around a few Mb of data you might not notice problems even on a flakey system. Once you get in to the Tb range or a fair fraction of a petabyte, unexpected errors are more likely to show up even on seemingly perfect hardware. Different techniques save you from different failure scenarios. A complicated filesystem that does it all for you behind the scenes might seem great at first, but if and when it does eventually come crashing down, your options to do any recovery are going to be restricted by the complexity of the thing in the first place. Sure, you should have backups. But there are times when you have 99% of your data backed up, but still want to recover that one file that was critically modified just after the last time it hit the tape. This is the counter-argument to those who insist that media longevity doesn't matter, because finding a drive to read it back in 25 years time will be impossible, (which in many cases is dubious anyway). Migrating your data from one place to another repeatedly could easily introduce bit errors that go un-noticed. Don't assume that the drive's own CRC checking will catch it. > Does anyone have examples of such shell scripts, such as with chsum(1) or > md5(1)? I posted one to -misc a couple of years ago. Here is a slightly updated version. if [ "$1" == "i" ] ; then touch checksums ; fi for i in `find . -name checksums` ; do ( if [ "$1" == "a" ] ; then echo -n "Not v" ; else echo -n "V" ; fi echo "erifying checksums in directory ${i%/checksums}"; cd ${i%/checksums}; if [ "$1" != "a" ] ; then sha512 -cq checksums; fi let flag=0; for j in !(checksums|checksums.bak) ; do if [ ! -d "$j" ] ; then grep -F "($j)" checksums > /dev/null || { if [ -z "$1" ] ; then echo "$j is not in the checksums file!" ; let flag=1 ; else echo "Adding $j to checksums file" ; sha512 "$j" >> checksums ; fi ; } fi ; done ; if [ $flag -eq 1 ] ; then echo "Run $0 with any command line arguments to add missing entries to the checksums file."; else echo "All files have entries in the checksum file."; fi ; ); done if [ "$1" == "i" -a ! -s checksums ] ; then rm -f checksums ; fi This is an _example_ to get you started writing your own script. The whole idea is to make something that suits your needs. Don't just copy and paste this one without tweaking it for yourself. > Do you trust them with giant files of several gigabytes or even > terabyte-sized VM > or database files? Yes. Sha512 hashes have kept many Tb of data intact for me, and detected various random bit errors from time to time. > If cryptographic hashes would be better, would features of LibreSSL, OpenSSH, > or > another OpenBSD base system tool be suitable for a fileserver/NAS? Sha256 is easily enough to detect random data corruption. > Looking at the manpages, I don't think softraid(4) or bioctl(8) can contribute > to repairing or replacing an individual corrupted file with a known good > copy, or > that ffs has a copies-equals-two option, except for the superblock described > in > fs(5) (search for the word 'copies'). Again, scripts. On my main workstation, I back up $HOME to another hard disk in the same machine several times a day. Whenever I finish a large edit, new version of a patch, or just go to get coffee, I invoke the backup script. 15 seconds later it's done, and automatically deleted the oldest backup. A separate script restores the most recent one to a freshly mounted ramdisk. Then I just cp over whichever file I accidently screwed up. If I invoke the backup script with an argument, it backs up to an external SSD identified by it's DUID. Once a day before I go home. (Of course, we have a propper backup strategy in place as well, this is just my personal way to get back up and running faster.) > If the periodic checksum script discovers an error, what are options to > correct > it on the same system without restoring from backup? It depends on the rest of the setup. > But I want to consider OpenBSD for my NAS and see how others such as Brian > have > succeeded at it. There is a danger of over-thinking things here. Just about any system will occasionally read a bad bit from disk, whatever anybody tells you. What matters is that the application detects that and doesn't process bad data as good. For a home data storage setup, just use a regular disk and regular backups. If you need good uptime, (E.G. a music server for a radio station), then use a raid-1 mirror. If your data is valuable, (a book you spent 5 years writing), then make multiple backups, store them properly, and keep copies of the sha512 hashes in multiple places so that you know when you read the backup if it's good or not.