On 5/22/26 08:53, Charles Curley wrote:
I have four four terabyte hard drives. Each has a partition on it. The
four partitions comprise a RAID 5 array using mdadm. On top of that,
LUKS encryption, then LVM with ext4 logical volumes.

On one LVM partition I have a number of backup files, tarred,
bzipped, and sha256 and sha512 summed. I have a script which will find
checksum files, and execute the appropriate program to test the
archives. It puts each program into the background, parallising any
number of checksum tests.

Starting about a week ago, the script finds an error in one or more
files out of several. Results are inconsistent: one pass may find an
error in a given file, the next pass not find any errors in it. Running
checksums manually, one at a time, does not turn up an error. Running
"tar tvf" finds no error in a suspect file. Running "bunzip2 -t" also
turns up no error. Only running the script turns up any errors.

I create two checksum files when I create the backups, for sha256 and
sha512. After this problem surfaced (about a week ago), I then made two
new checksum files of a suspect file. The two checksum file pairs
(e.g. both sha512sum files) show the same checksums. The script now
tests using both the old and new checksum files. Sometime only one pair
of checksum files fail the suspect file.

In addition to all of that, I also get the occasional "bad message"
error. I have no idea what that means, but an fsck seems to deal with
it.

To be thorough, I have run extended SMART tests on the hard drives,
kicked mdadm into testing the RAID array, and fscked the LVM partitions
on the RAID array. Only fsck turned up issues, and that has not stopped.

I also back some of this up to offsite USB drives. I ran the script on
one of those, using a different computer. No errors reported.

I have a hypothesis as to what is going on, but would like to hear from
you before I discuss it.


On 5/22/26 09:05, Andrew Latham wrote:
> I had an issue some months back. It turned out to be a bad RAM stick
> in my NAS. The issues would not show up on a restart but after some
> usage it would hit the RAM errors and :(


On 5/22/26 09:14, Charles Curley wrote:
> This is not impossible. I recently had some RAM go bad, failing
> memtest. I have replaced it with new RAM, which does not fail
> memtest. Maybe I should let it run for several passes.


When I suspect a memory problem, I run Memtest86+ for 24+ hours:

    https://memtest.org/

    Linux ISO (64 bits) -> mt86plus_8.10_x86_64.iso.zip

The current version sets "CPU Sequencing Mode" to "Parallel (PAR)" by default.


I use and suggest ECC memory.


I use and suggest ZFS with redundant disks for storage.


David

Reply via email to