Hey Patrick,

> I do not normally post about firmware bugs, but I have this nightmare 
> scenario running through my head of someone with a couple of mirrored HPE SSD 
> arrays and all the drives going POOF!  simultaneously. Even with an off-site 
> backup, that could be disastrous. So if you have HPE SSDs, check this 
> announcement.
>
> https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-a00092491en_us

Couple years back lot of folk had this problem with many vendors'
optics. There was one particular vendor whose microcontrollers were
commonly used by many vendors and this microcontroller had bug that
after 2**31 1/100th of a second it started to write uptime to memory
location of temperature, and many systems including Cisco and Juniper
didn't react well optic temperatures reaching maximum possible values.

So say you had large network wide upgrades 2**31 1/100th of a second
ago, with enough time between upgrades to ensure that everything works
before continuing on redundant parts. Then you'd suddenly lose like
stack of cards all legs from all devices, no matter how much
redundancy was built in.

Just goes to show that focus on MTBF is usually not a great
investment, it's hard to predict what brings you down and we tend to
bias on thinking it's some physical problem, solved by redundant HW
design, when it probably is not, it's probably something related to
software or operator and hard to predict or prepare for. MTTR focus
will have much more predictable ROI.

I can't really point finger at HP here, these are common bugs and easy
thing to miss for a human. Perhaps static analysis or more complexity
to compiler and compile time guarantees should have covered this.
-- 
  ++ytti

Reply via email to