All very true and worth considering, but I feel compelled to mention the 
strategy of setting mon_osd_down_out_subtree_limit carefully to prevent 
automatic rebalancing.

*If* the loss of a failure domain is temporary, ie. something you can fix 
fairly quickly, it can be preferable to not start that avalanche of recovery, 
both to avoid contention with client workloads and also for the fillage factor 
that David describes.

If of course the loss of the failure domain can’t be corrected quickly, then 
one would still be in a quandary re whether to shift the capacity onto the 
surviving failure domains or take the risk of reduced redundancy while the 
problem is worked.

That said, I’ve seen situations where OSD’s in a failure domain weren’t 
reported down in close enough temporal proximity, and the subtree limit didn’t 
kick in.

In my current situation we’re already planning to exploit the half-rack 
strategy you describe for EC clusters, it improves the failure domain situation 
without being monopolizing as many DC racks.

— aad 

> The problem with having 3 failure domains with replica 3 is that if you
> lose a complete failure domain, then you have nowhere for the 3rd replica
> to go.  If you have 4 failure domains with replica 3 and you lose an entire
> failure domain, then you over fill the remaining 3 failure domains and can
> only really use 55% of your cluster capacity.  If you have 5 failure
> domains, then you start normalizing and losing a failure domain doesn't
> impact as severely.  The more failure domains you get to, the less it
> affects you when you lose one.

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to