Bob Proulx wrote: > > Ruben is working on this problem full time since the previous report. > > This is an urgent and important problem. The host system and all of > the hosted VMs including *.savannah will be rebooted (probably more > than once) today as part of the debugging. These may be longer-ish > downtimes as the underlying host will need more time to boot and then > all of the VMs hosted will need to boot.
The problem was a faulty SSD in a RAID10 set of four. It was returning corrupted data and reporting it as good data. The faulty drive has been removed from the array. All of the systems have been booted back up and are online again. Data being written was being written to both good and bad drives. Data read was interlaced between the good and bad drives. That is why the read data being accessed would sometimes have corruption. By removing the corrupting drive it is believed that all of the data on the remaining good drive is okay. Bob
