This sounds like all your nodes are on a single switch, which for production is 
risky, for this reason and others.

If that’s the case, I suggest shutting down the cluster completely in advance, 
as described in the docs.

> On May 29, 2022, at 9:10 PM, Jeremy Hansen <[email protected]> wrote:
> 
> So in my experience so far, if I take out a switch after a firmware update 
> and a reboot of the switch, meaning all ceph nodes lose network connectivity 
> and communication with each other, Ceph becomes unresponsive and my only fix 
> up to this point has been to, one by one, reboot the compute nodes. Are you 
> saying I just need to wait? I don’t know how long I’ve waited in the past, 
> but if you’re saying at least 10 minutes, I probably haven’t waited that long.
> 
> Thanks
> -jeremy
> 
>> On Sunday, May 29, 2022 at 3:40 PM, Tyler Stachecki 
>> <[email protected] (mailto:[email protected])> wrote:
>> Ceph always aims to provide high availability. So, if you do not set cluster 
>> flags that prevent Ceph from trying to self-heal, it will self-heal.
>> 
>> Based on your description, it sounds like you want to consider the 'noout' 
>> flag. By default, after 10(?) minutes of an OSD being down, Ceph will begin 
>> the process of outing the affected OSD to ensure high availability.
>> 
>> But be careful, as far as latency goes -- you likely still want to 
>> pre-emptively mark OSDs down ahead of the planned maintenance for latency 
>> purposes, and you must be cognisant of whether or not your replication 
>> policy puts you in a position where an unrelated failure during the 
>> maintenance can result in inactive PGs.
>> 
>> Cheers,
>> Tyler
>> 
>> 
>> On Sun, May 29, 2022, 5:30 PM Jeremy Hansen <[email protected] 
>> (mailto:[email protected])> wrote:
>>> Is there a maintenance mode for Ceph that would allow me to do work on 
>>> underlying network equipment without causing Ceph to panic? In our test 
>>> lab, we don’t have redundant networking and when doing switch upgrades and 
>>> such, Ceph has a panic attack and we end up having to reboot Ceph nodes 
>>> anyway. Like an hdfs style readonly mode or something?
>>> 
>>> Thanks!
>>> 
>>> _______________________________________________
>>> ceph-users mailing list -- [email protected] (mailto:[email protected])
>>> To unsubscribe send an email to [email protected] 
>>> (mailto:[email protected])
> _______________________________________________
> ceph-users mailing list -- [email protected]
> To unsubscribe send an email to [email protected]

_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to