On Thu, Apr 23, 2015 at 5:20 AM, Kenneth Waegeman
>
> So it is all fixed now, but is it explainable that at first about 90% of
> the OSDS going into shutdown over and over, and only after some time got in
> a stable situation, because of one host network failure ?
>
> Thanks again!


Yes, unless you've adjusted:
[global]
  mon osd min down reporters = 9
  mon osd min down reports = 12

OSDs talk to the MONs on the public network.  The cluster network is only
used for OSD to OSD communication.

If one OSD node can't talk on that network, the other nodes will tell the
MONs that it's OSDs are down.  And that node will also tell the MONs that
all the other OSDs are down.  Then the OSDs marked down will tell the MONs
that they're not down, and the cycle will repeat.

I'm somewhat surprised that your cluster eventually stabilized.


I have 8 OSDs per node.  I set my min down reporters high enough that no
single node can mark another node's OSDs down.
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to