Thanks everyone for your feedback. I created a ticket and added most of our internal post-mortem research:
https://tracker.ceph.com/issues/63636 Cheers, Denis > On 24 Nov 2023, at 09:01, Denis Krienbühl <[email protected]> wrote: > > Hi > > We’ve recently had a serious outage at work, after a host had a network > problem: > > - We rebooted a single host in a cluster of fifteen hosts across three racks. > - The single host had a bad network configuration after booting, causing it > to send some packets to the wrong network. > - One network still worked and offered a connection to the mons. > - The other network connection was bad. Packets were refused, not dropped. > - Due to osd_fast_fail_on_connection_refused=true, the broken host forced the > mons to take all other OSDs down (immediate failure). > - Only after shutting down the faulty host, was it possible to start the shut > down OSDs, to restore the cluster. > > We have since solved the problem by removing the default route that caused > the packets to end up in the wrong network, where they were summarily > rejected by a firewall. That is, we made sure that packets would be dropped > in the future, not rejected. > > Still, I figured I’ll send this experience of ours to this mailing list, as > this seems to be something others might encounter as well. > > In the following PR, that introduced osd_fast_fail_on_connection_refused, > there’s this description: > >> This changeset adds additional handler (handle_refused()) to the dispatchers >> and code that detects when connection attempt fails with ECONNREFUSED error >> (connection refused) which is a clear indication that host is alive, but >> daemon isn't, so daemons can instantly mark the other side as undoubtly >> downed without the need for grace timer. > > And this comment: > >> As for flapping, we discussed it on ceph-devel ml >> and came to conclusion that it requires either broken firewall or network >> configuration to cause this, and these are more serious issues that should >> be resolved first before worrying about OSDs flapping (either way, flapping >> OSDs could be good for getting someone's attention). > > https://github.com/ceph/ceph/pull/8558https://github.com/ceph/ceph/pull/8558 > > It has left us wondering if these are the right assumptions. An ECONNREFUSED > condition can bring down a whole cluster, and I wonder if there should be > some kind of safe-guard to ensure that this is avoided. One badly configured > host should generally not be able do that, and if the packets are dropped, > instead of refused, the cluster notices that the OSD down reports come only > from one host, and acts accordingly. > > What do you think? Does this warrant a change in Ceph? I’m happy to provide > details and create a ticket. > > Cheers, > > Denis > _______________________________________________ > ceph-users mailing list -- [email protected] > To unsubscribe send an email to [email protected] _______________________________________________ ceph-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
