Hi Dennis,

I have to ask a clarifying question. If I understand the intend of 
osd_fast_fail_on_connection_refused correctly, an OSD that receives a 
connection_refused should get marked down fast to avoid unnecessarily long wait 
times. And *only* OSDs that receive connection refused.

In your case, did booting up the server actually create a network route for all 
other OSDs to the wrong network as well? In other words, did it act as a 
gateway and all OSDs received connection refused messages and not just the ones 
on the critical host? If so, your observation would be expected. If not, then 
there is something wrong with the down reporting that should be looked at.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Frank Schilder <[email protected]>
Sent: Friday, November 24, 2023 1:20 PM
To: Denis Krienbühl; Burkhard Linke
Cc: [email protected]
Subject: [ceph-users] Re: Full cluster outage when ECONNREFUSED is triggered

Hi Denis.

>  The mon then propagates that failure, without taking any other reports into 
> consideration:

Exactly. I cannot imagine that this change of behavior is intended. The configs 
on OSD down reporting ought to be honored in any failure situation. Since you 
already investigated the relevant code lines, please update/create the tracker 
with your findings. Hope a dev looks at this.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Denis Krienbühl <[email protected]>
Sent: Friday, November 24, 2023 12:04 PM
To: Burkhard Linke
Cc: [email protected]
Subject: [ceph-users] Re: Full cluster outage when ECONNREFUSED is triggered


> On 24 Nov 2023, at 11:49, Burkhard Linke 
> <[email protected]> wrote:
>
> This should not be case in the reported situation unless setting 
> osd_fast_fail_on_connection_refused<https://docs.ceph.com/en/latest/rados/configuration/osd-config-ref/#confval-osd_fast_fail_on_connection_refused>=true
>  changes this behaviour.


In our tests it does change the behavior. Usually the mons take 
mon_osd_reporter_subtree_level and mon_osd_min_down_reporters into account. In 
our tests, this is the case if an OSD heartbeat is dropped and the OSD is still 
able to talk to the mons.

However, if the OSD heartbeat is rejected, in our case because of an unrelated 
firewall change, the OSD sends an immediate failure to the mon:
https://github.com/ceph/ceph/blob/febfdd83a7838338033486826ef1fc9a5e8d588e/src/osd/OSD.cc#L6434
ceph/src/osd/OSD.cc at febfdd83a7838338033486826ef1fc9a5e8d588e · ceph/ceph
github.com


The mon then propagates that failure, without taking any other reports into 
consideration:

https://github.com/ceph/ceph/blob/febfdd83a7838338033486826ef1fc9a5e8d588e/src/mon/OSDMonitor.cc#L3367
ceph/src/mon/OSDMonitor.cc at febfdd83a7838338033486826ef1fc9a5e8d588e · 
ceph/ceph
github.com

This is fine when a single OSD goes down and everything else is okay. It then 
has the intended effect of getting rid of the OSD fast. The assumption 
presumably being: If a host can answer with a rejection to the OSD heartbeat, 
it is only the OSD that is affected.

In our case however, a network change caused rejections from an entirely 
different host (a gateway), while a network path to the mons was still 
available. In this case, Ceph does not apply the safe-guards it usually does.
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to