Hi David Turner, This is our ceph config under mon section , we have EC 4+1 and set the failure domain as host and osd_min_down_reporters to 4 ( osds from 4 different host ) .
[mon] mon_compact_on_start = True mon_osd_down_out_interval = 86400 mon_osd_down_out_subtree_limit = host mon_osd_min_down_reporters = 4 mon_osd_reporter_subtree_level = host We have 68 disks , can we increase sd_min_down_reporters to 68 ? Thanks, Muthu On Tue, May 22, 2018 at 5:46 PM, David Turner <[email protected]> wrote: > What happens when a storage node loses its cluster network but not it's > public network is that all other osss on the cluster see that it's down and > report that to the mons, but the node call still talk to the mons telling > the mons that it is up and in fact everything else is down. > > The setting osd _min_reporters (I think that's the name of it off the top > of my head) is designed to help with this scenario. It's default is 1 which > means any osd on either side of the network problem will be trusted by the > mons to mark osds down. What you want to do with this seeing is to set it > to at least 1 more than the number of osds in your failure domain. If the > failure domain is host and each node has 32 osds, then setting it to 33 > will prevent a full problematic node from being able to cause havoc. > > The osds will still try to mark themselves as up and this will still cause > problems for read until the osd process stops or the network comes back up. > There might be a seeing for how long an odd will try telling the mons it's > up, but this isn't really a situation I've come across after initial > testing and installation of nodes. > > On Tue, May 22, 2018, 1:47 AM nokia ceph <[email protected]> wrote: > >> Hi Ceph users, >> >> We have a cluster with 5 node (67 disks) and EC 4+1 configuration and >> min_size set as 4. >> Ceph version : 12.2.5 >> While executing one of our resilience usecase , making private interface >> down on one of the node, till kraken we saw less outage in rados (60s) . >> >> Now with luminous, we could able to see rados read/write outage for more >> than 200s . In the logs we could able to see that peer OSDs inform that one >> of the node OSDs are down however the OSDs defend like it is wrongly >> marked down and does not move to down state for long time. >> >> 2018-05-22 05:37:17.871049 7f6ac71e6700 0 log_channel(cluster) log [WRN] >> : Monitor daemon marked osd.1 down, but it is still running >> 2018-05-22 05:37:17.871072 7f6ac71e6700 0 log_channel(cluster) log [DBG] >> : map e35690 wrongly marked me down at e35689 >> 2018-05-22 05:37:17.878347 7f6ac71e6700 0 osd.1 35690 crush map has >> features 1009107927421960192, adjusting msgr requires for osds >> 2018-05-22 05:37:18.296643 7f6ac71e6700 0 osd.1 35691 crush map has >> features 1009107927421960192, adjusting msgr requires for osds >> >> >> Only when all 67 OSDs are move to down state , the read/write traffic is >> resumed. >> >> Could you please help us in resolving this issue and if it is bug , we >> will create corresponding ticket. >> >> Thanks, >> Muthu >> _______________________________________________ >> ceph-users mailing list >> [email protected] >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
