Hi,

I coudn't agree more, but just to re-emphasize what others already said:

  the point of replica 3 is not to have extra safety for
  (human|software|server) failures, but to have enough data around to
  allow rebalancing the cluster when disks fail.

after a certain amount of disks in a cluster, you're going to get disks
failures all the time. if you don't pay extra attention (and wasting
lots and lots of time/money) to carefully arrange/choose disks of
different vendor productions lines/dates, simultaneous disk failures
happen within minutes.


example from our past:

on our (at that time small) cluster of 72 disks spread over 6 storage
nodes, half of them were seagate enterprice capacity disks, the other
half western digitial red pro. for each disk manufacturer, we bought
only half of the disks from the same production. so.. we had..

  * 18 disks wd, production charge A
  * 18 disks wd, production charge B
  * 18 disks seagate, production charge C
  * 18 disks seagate, production charge D

one day, 6 disks failed simultaneously spread over two storage nodes.
had we had replica 2, we couldn't recover and would have lost data.
instead, because of replica 3, we didn't loose any data and ceph
automatically rebalanced all data before further disks were failing.


so: if re-creating data stored on the cluster is valuable (because it
costs much time and effort to 're-collect' it, or you can't accept the
time it takes to restore from backup, or worse to re-create it from
scratch), you have to assume that whatever manufacturer/production
charge of HDs you're using, they *can* fail all at the same time because
you could have hit a faulty production.

the only way out here is replica >=3.

(of course, the whole MTBF and why raid doesn't scale applies as well)

Regards,
Daniel
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to