Re: SMART Uncorrectable_Error_Cnt rising - should I be worried?

David Christensen Wed, 10 Jan 2024 18:17:18 -0800

On 1/10/24 09:07, Curt wrote:

On 2024-01-10, David Christensen <dpchr...@holgerdanske.com> wrote:

Given the OP's situation -- 8 consumer SSD's, same make and model,
possibly from a defective manufacturing batch, all purchased at the same
time, all deployed in the same RAID-6, all run 2.5 years 24x7, and all
suddenly showing lots of SMART warnings -- I would not have confidence
in that RAID.


It's curious, but I just heard something on French TV from a journalist
that's relevant to this. She said she'd covered the aeronautics field in
the past and mentioned the *principe de dissemblance* (dissimilarity
principle). Critical redundant parts on aircraft, she claimed, would be
sourced from different manufacturers in order to obviate the possibility
of redundant failures you've raised here.



https://en.wikipedia.org/wiki/Failure_analysis

Using components from different vendors is a known mitigation techniqueand can help in the right situations.

Relevant to this this thread, some people use disk drives from differentmanufacturers in their RAID's. Doing so with RAID-10 (stripe ofmirrors) is straight forward -- within each mirror, use brand A for thefirst disk, brand B for the second, brand C for the third (or hotspares), etc.. It then makes sense to do the same with HBA's -- use HBAbrand X for the first disks in each mirror, HBA brand Y for the seconddisks, HBA brand Z for the third/ spare disks, etc.. For x86workstations and servers, ECC memory, dual network interfaces, and dualpower supplies come to mind. I am unclear about dual processors and/ordual memory banks. Moving beyond one computer, the process continueswith KVM/ serial console fabric, networks, electric power, cooling,etc.. It's just a question of what failure modes you want to protectagainst and how much time and money you want to spend.



David

Re: SMART Uncorrectable_Error_Cnt rising - should I be worried?

Reply via email to