Hi David Turner,

This is our ceph config under mon section , we have EC 4+1 and set the
failure domain as host and osd_min_down_reporters to 4 ( osds from 4
different host ) .

[mon]
mon_compact_on_start = True
mon_osd_down_out_interval = 86400
mon_osd_down_out_subtree_limit = host
mon_osd_min_down_reporters = 4
mon_osd_reporter_subtree_level = host

We have 68 disks , can we increase  sd_min_down_reporters  to 68 ?

Thanks,
Muthu

On Tue, May 22, 2018 at 5:46 PM, David Turner <[email protected]> wrote:

> What happens when a storage node loses its cluster network but not it's
> public network is that all other osss on the cluster see that it's down and
> report that to the mons, but the node call still talk to the mons telling
> the mons that it is up and in fact everything else is down.
>
> The setting osd _min_reporters (I think that's the name of it off the top
> of my head) is designed to help with this scenario. It's default is 1 which
> means any osd on either side of the network problem will be trusted by the
> mons to mark osds down. What you want to do with this seeing is to set it
> to at least 1 more than the number of osds in your failure domain. If the
> failure domain is host and each node has 32 osds, then setting it to 33
> will prevent a full problematic node from being able to cause havoc.
>
> The osds will still try to mark themselves as up and this will still cause
> problems for read until the osd process stops or the network comes back up.
> There might be a seeing for how long an odd will try telling the mons it's
> up, but this isn't really a situation I've come across after initial
> testing and installation of nodes.
>
> On Tue, May 22, 2018, 1:47 AM nokia ceph <[email protected]> wrote:
>
>> Hi Ceph users,
>>
>> We have a cluster with 5 node (67 disks) and EC 4+1 configuration and
>> min_size set as 4.
>> Ceph version : 12.2.5
>> While executing one of our resilience usecase , making private interface
>> down on one of the node, till kraken we saw less outage in rados (60s) .
>>
>> Now with luminous, we could able to see rados read/write outage for more
>> than 200s . In the logs we could able to see that peer OSDs inform that one
>> of the node OSDs are down however the OSDs  defend like it is wrongly
>> marked down and does not move to down state for long time.
>>
>> 2018-05-22 05:37:17.871049 7f6ac71e6700  0 log_channel(cluster) log [WRN]
>> : Monitor daemon marked osd.1 down, but it is still running
>> 2018-05-22 05:37:17.871072 7f6ac71e6700  0 log_channel(cluster) log [DBG]
>> : map e35690 wrongly marked me down at e35689
>> 2018-05-22 05:37:17.878347 7f6ac71e6700  0 osd.1 35690 crush map has
>> features 1009107927421960192, adjusting msgr requires for osds
>> 2018-05-22 05:37:18.296643 7f6ac71e6700  0 osd.1 35691 crush map has
>> features 1009107927421960192, adjusting msgr requires for osds
>>
>>
>> Only when all 67 OSDs are move to down state , the read/write traffic is
>> resumed.
>>
>> Could you please help us in resolving this issue and if it is bug , we
>> will create corresponding ticket.
>>
>> Thanks,
>> Muthu
>> _______________________________________________
>> ceph-users mailing list
>> [email protected]
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to