Hi Denis.

>  The mon then propagates that failure, without taking any other reports into 
> consideration:

Exactly. I cannot imagine that this change of behavior is intended. The configs 
on OSD down reporting ought to be honored in any failure situation. Since you 
already investigated the relevant code lines, please update/create the tracker 
with your findings. Hope a dev looks at this.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Denis Krienbühl <de...@href.ch>
Sent: Friday, November 24, 2023 12:04 PM
To: Burkhard Linke
Cc: ceph-users@ceph.io
Subject: [ceph-users] Re: Full cluster outage when ECONNREFUSED is triggered


> On 24 Nov 2023, at 11:49, Burkhard Linke 
> <burkhard.li...@computational.bio.uni-giessen.de> wrote:
>
> This should not be case in the reported situation unless setting 
> osd_fast_fail_on_connection_refused<https://docs.ceph.com/en/latest/rados/configuration/osd-config-ref/#confval-osd_fast_fail_on_connection_refused>=true
>  changes this behaviour.


In our tests it does change the behavior. Usually the mons take 
mon_osd_reporter_subtree_level and mon_osd_min_down_reporters into account. In 
our tests, this is the case if an OSD heartbeat is dropped and the OSD is still 
able to talk to the mons.

However, if the OSD heartbeat is rejected, in our case because of an unrelated 
firewall change, the OSD sends an immediate failure to the mon:
https://github.com/ceph/ceph/blob/febfdd83a7838338033486826ef1fc9a5e8d588e/src/osd/OSD.cc#L6434
ceph/src/osd/OSD.cc at febfdd83a7838338033486826ef1fc9a5e8d588e · ceph/ceph
github.com


The mon then propagates that failure, without taking any other reports into 
consideration:

https://github.com/ceph/ceph/blob/febfdd83a7838338033486826ef1fc9a5e8d588e/src/mon/OSDMonitor.cc#L3367
ceph/src/mon/OSDMonitor.cc at febfdd83a7838338033486826ef1fc9a5e8d588e · 
ceph/ceph
github.com

This is fine when a single OSD goes down and everything else is okay. It then 
has the intended effect of getting rid of the OSD fast. The assumption 
presumably being: If a host can answer with a rejection to the OSD heartbeat, 
it is only the OSD that is affected.

In our case however, a network change caused rejections from an entirely 
different host (a gateway), while a network path to the mons was still 
available. In this case, Ceph does not apply the safe-guards it usually does.
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to