[ceph-users] Re: Full cluster outage when ECONNREFUSED is triggered

Denis Krienbühl Fri, 24 Nov 2023 05:40:40 -0800

Hi Frank.

> On 24 Nov 2023, at 14:27, Frank Schilder <[email protected]> wrote:
> 
> I have to ask a clarifying question. If I understand the intend of 
> osd_fast_fail_on_connection_refused correctly, an OSD that receives a 
> connection_refused should get marked down fast to avoid unnecessarily long 
> wait times. And *only* OSDs that receive connection refused.
> 
> In your case, did booting up the server actually create a network route for 
> all other OSDs to the wrong network as well? In other words, did it act as a 
> gateway and all OSDs received connection refused messages and not just the 
> ones on the critical host? If so, your observation would be expected. If not, 
> then there is something wrong with the down reporting that should be looked 
> at.


No, the server has two networks through which to reach OSDs and mons. Say north 
and south. South was down and the traffic destined to it made it through the 
default gateway to an unrelated host that would bounce everything with 
“connection refused”.

North was still up, and through it the other OSDs and mons could also be 
reached.

So the host that was bootet had the wrong configuration.

The packets on the other hosts of the cluster were unaffected and all their 
network configuration remained as is, though they would not have reached the 
OSDs on the booted host via south anymore. Those would have been dropped by my 
understanding.

I’ll be sure to create a detailed ticket and to post it to this thread, I’m 
just not sure I’ll be able to do it today, but after what I’ve heard I think 
this should at least be looked at in detail and I’ll be sure to provide as much 
info as I can.

Denis
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Re: Full cluster outage when ECONNREFUSED is triggered

Reply via email to