Interesting paper at FAST:
https://www.usenix.org/system/files/conference/fast15/fast15-paper-ma.pdf
Short version: reallocated sectors correllates with impending disk
failures (this sounds like what Sandon has been telling us for ages) and
by preemptively replacing disks with impending failures reduced EMC's rate
of triple-failures by 80%, and looking at the joint failure probability
within each raid set reduces the failure rate by 98%. We wouldn't see
quite the same results since our "raid sets" are effectively entire pools,
but this seems like a strong case for adding smart monitoring to the osds
or to calamari already and doing some preemptive disk replacement.
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html