[ceph-users] Re: Full cluster outage when ECONNREFUSED is triggered

Denis Krienbühl Fri, 24 Nov 2023 08:55:01 -0800

Thanks everyone for your feedback. I created a ticket and added most of our 
internal post-mortem research:


https://tracker.ceph.com/issues/63636

Cheers, Denis

> On 24 Nov 2023, at 09:01, Denis Krienbühl <[email protected]> wrote:
> 
> Hi
> 
> We’ve recently had a serious outage at work, after a host had a network 
> problem: 
> 
> - We rebooted a single host in a cluster of fifteen hosts across three racks.
> - The single host had a bad network configuration after booting, causing it 
> to send some packets to the wrong network.
> - One network still worked and offered a connection to the mons.
> - The other network connection was bad. Packets were refused, not dropped.
> - Due to osd_fast_fail_on_connection_refused=true, the broken host forced the 
> mons to take all other OSDs down (immediate failure).
> - Only after shutting down the faulty host, was it possible to start the shut 
> down OSDs, to restore the cluster.
> 
> We have since solved the problem by removing the default route that caused 
> the packets to end up in the wrong network, where they were summarily 
> rejected by a firewall. That is, we made sure that packets would be dropped 
> in the future, not rejected.
> 
> Still, I figured I’ll send this experience of ours to this mailing list, as 
> this seems to be something others might encounter as well.
> 
> In the following PR, that introduced osd_fast_fail_on_connection_refused, 
> there’s this description:
> 
>> This changeset adds additional handler (handle_refused()) to the dispatchers
>> and code that detects when connection attempt fails with ECONNREFUSED error
>> (connection refused) which is a clear indication that host is alive, but
>> daemon isn't, so daemons can instantly mark the other side as undoubtly
>> downed without the need for grace timer.
> 
> And this comment:
> 
>> As for flapping, we discussed it on ceph-devel ml
>> and came to conclusion that it requires either broken firewall or network
>> configuration to cause this, and these are more serious issues that should
>> be resolved first before worrying about OSDs flapping (either way, flapping
>> OSDs could be good for getting someone's attention).
> 
> https://github.com/ceph/ceph/pull/8558https://github.com/ceph/ceph/pull/8558
> 
> It has left us wondering if these are the right assumptions. An ECONNREFUSED 
> condition can bring down a whole cluster, and I wonder if there should be 
> some kind of safe-guard to ensure that this is avoided. One badly configured 
> host should generally not be able do that, and if the packets are dropped, 
> instead of refused, the cluster notices that the OSD down reports come only 
> from one host, and acts accordingly.
> 
> What do you think? Does this warrant a change in Ceph? I’m happy to provide 
> details and create a ticket.
> 
> Cheers,
> 
> Denis
> _______________________________________________
> ceph-users mailing list -- [email protected]
> To unsubscribe send an email to [email protected]

_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Re: Full cluster outage when ECONNREFUSED is triggered

Reply via email to