Thanks Frank. I see it the same way. I’ll be sure to create a ticket with all the details and steps to reproduce the issue.
Denis > On 24 Nov 2023, at 10:24, Frank Schilder <[email protected]> wrote: > > Hi Denis, > > I would agree with you that a single misconfigured host should not take out > healthy hosts under any circumstances. I'm not sure if your incident is > actually covered by the devs comments, it is quite possible that you observed > an unintended side effect that is a bug in handling the connection error. I > think the intention is to shut down fast the OSDs with connection refused > (where timeouts are not required) and not other OSDs. > > A bug report with tracker seems warranted. > > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Denis Krienbühl <[email protected]> > Sent: Friday, November 24, 2023 9:01 AM > To: ceph-users > Subject: [ceph-users] Full cluster outage when ECONNREFUSED is triggered > > Hi > > We’ve recently had a serious outage at work, after a host had a network > problem: > > - We rebooted a single host in a cluster of fifteen hosts across three racks. > - The single host had a bad network configuration after booting, causing it > to send some packets to the wrong network. > - One network still worked and offered a connection to the mons. > - The other network connection was bad. Packets were refused, not dropped. > - Due to osd_fast_fail_on_connection_refused=true, the broken host forced the > mons to take all other OSDs down (immediate failure). > - Only after shutting down the faulty host, was it possible to start the shut > down OSDs, to restore the cluster. > > We have since solved the problem by removing the default route that caused > the packets to end up in the wrong network, where they were summarily > rejected by a firewall. That is, we made sure that packets would be dropped > in the future, not rejected. > > Still, I figured I’ll send this experience of ours to this mailing list, as > this seems to be something others might encounter as well. > > In the following PR, that introduced osd_fast_fail_on_connection_refused, > there’s this description: > >> This changeset adds additional handler (handle_refused()) to the dispatchers >> and code that detects when connection attempt fails with ECONNREFUSED error >> (connection refused) which is a clear indication that host is alive, but >> daemon isn't, so daemons can instantly mark the other side as undoubtly >> downed without the need for grace timer. > > And this comment: > >> As for flapping, we discussed it on ceph-devel ml >> and came to conclusion that it requires either broken firewall or network >> configuration to cause this, and these are more serious issues that should >> be resolved first before worrying about OSDs flapping (either way, flapping >> OSDs could be good for getting someone's attention). > > https://github.com/ceph/ceph/pull/8558https://github.com/ceph/ceph/pull/8558 > > It has left us wondering if these are the right assumptions. An ECONNREFUSED > condition can bring down a whole cluster, and I wonder if there should be > some kind of safe-guard to ensure that this is avoided. One badly configured > host should generally not be able do that, and if the packets are dropped, > instead of refused, the cluster notices that the OSD down reports come only > from one host, and acts accordingly. > > What do you think? Does this warrant a change in Ceph? I’m happy to provide > details and create a ticket. > > Cheers, > > Denis > _______________________________________________ > ceph-users mailing list -- [email protected] > To unsubscribe send an email to [email protected] _______________________________________________ ceph-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
