I have four four terabyte hard drives. Each has a partition on it. The four partitions comprise a RAID 5 array using mdadm. On top of that, LUKS encryption, then LVM with ext4 logical volumes.
On one LVM partition I have a number of backup files, tarred, bzipped, and sha256 and sha512 summed. I have a script which will find checksum files, and execute the appropriate program to test the archives. It puts each program into the background, parallising any number of checksum tests. Starting about a week ago, the script finds an error in one or more files out of several. Results are inconsistent: one pass may find an error in a given file, the next pass not find any errors in it. Running checksums manually, one at a time, does not turn up an error. Running "tar tvf" finds no error in a suspect file. Running "bunzip2 -t" also turns up no error. Only running the script turns up any errors. I create two checksum files when I create the backups, for sha256 and sha512. After this problem surfaced (about a week ago), I then made two new checksum files of a suspect file. The two checksum file pairs (e.g. both sha512sum files) show the same checksums. The script now tests using both the old and new checksum files. Sometime only one pair of checksum files fail the suspect file. In addition to all of that, I also get the occasional "bad message" error. I have no idea what that means, but an fsck seems to deal with it. To be thorough, I have run extended SMART tests on the hard drives, kicked mdadm into testing the RAID array, and fscked the LVM partitions on the RAID array. Only fsck turned up issues, and that has not stopped. I also back some of this up to offsite USB drives. I ran the script on one of those, using a different computer. No errors reported. I have a hypothesis as to what is going on, but would like to hear from you before I discuss it. -- Does anybody read signatures any more? https://charlescurley.com https://charlescurley.com/blog/

