On Fri, Oct 27, 2023 at 6:42 AM Pierre Fourès <pierre.fou...@gmail.com>
wrote:

> Hi Felix,
>
> Your SMART data looks good to me, except for the hard drive temperature.
> Experiencing 53°C looks quite a lot to me. Yet, this should not be the
> cause of your corrupted data.
>
> Two data-corruption problems on the same server which looks independant
> from each other, and occured at a quite long time range interval from each
> other, reminds me of a server who caused me lots of trouble until I
> discovered it had memory defects. I suspected hard disk failure and/or hard
> drive data corruption, but couldn't nail it with smartctl nor with the
> badblocks utility. I eventually nailed the problem when doing extensive
> test with the stress utility, showing that in some runs, the memory was
> corrupting data (which ended up corrupting data on disk). I had to run the
> tests many times to spot the defect. Subtle defects are real hard to spot
> on.
>
> IMO, I would advice you to do a full scan of this server to spot where the
> problem is in order to file this trail of problems as definitively solved.
> In my situation, similar to your one, the problems occured too distantly
> from each other to commit resources to investigate thoroughly. This period
> of uncertaintly and intuitive distrust of the server caused us a hidden
> costs like stress and fatigue. Having experienced it, if that happened
> again, I would prefer to rule out this situation quickly instead of knowing
> it dormant.
>
> Here are some links which might be relevant to you :
>   - https://en.wikipedia.org/wiki/Badblocks
>   - https://wiki.archlinux.org/title/Badblocks
>   - https://man.archlinux.org/man/stress.1
>   - https://wiki.archlinux.org/title/Stress_testing
>   - https://www.memtest.org/
>
> Best Regards,
> Pierre.
>


I can speak to RAM corruption as well. In one instance, we were
experiencing the strangest problems and blamed just about everything until
I ran the above memtest utility and it showed tremendous numbers of memory
errors. When I opened up the hardware, I found dust on and around the
memory. I cleaned that very thoroughly, put the system back together, and
ran memtest overnight or over a weekend with zero errors. Evidently, dust
can be conductive enough to act like a bunch of resistors across pins that
shouldn't have resistors across them.

As trivial as that sounds, I recommend to check for things like dust, and
since heat was mentioned, I'd check for fans that don't spin very freely. I
also recommend running memtest over a weekend, and finally, I am with the
camp who believe that ECC RAM is a good idea, so I'd suggest to check
whether you are using ECC RAM.

Hope this helps,
Nathan

Reply via email to