Re: [PVE-User] Corosync and Cluster reboot

Iztok Gregori Wed, 08 Jan 2025 02:12:50 -0800

Hi!

On 07/01/25 15:15, DERUMIER, Alexandre wrote:

Personnaly, I'll recommand to disable HA  temporary during the network change  
(mv /etc/pve/ha/resources.cfg  to a tmp directory,  stop all pve-ha-lrm   , 
tehn stop all pve-ha-crm   to stop  the watchdog)


Then, after the migration, check the corosync logs during 1 or 2 days , and 
after that , if no retransmit occur, reenable HA.

Good advice. But with the pve-ha-* services down the "HA-VMs" cannotmigrate from a node to the other, because the migration is handled bythe HA (or at least that is how I remember to happen some time ago). SoI've (temporary) removed all the resources (VMs) from HA, which has theeffect to tell "pve-ha-lrm" to disable the watchdog( "watchdog closed(disabled)" ) and no reboot should occur.

It's really possible that it's a corosync bug (I remember to have had this kind 
of error with pve 7.X)

I'm leaning to a similar conclusion, but I'm still lacking inunderstanding of how corosync/watchdog is handled in Proxmox.

For example I still don't know who is updating the watchdog-mux service?Is corosync (but no "watchdog_device" is set in corosync.conf and bymanual "if unset, empty or "off", no watchdog is used.") or is pve-ha-lrm?

I think that, after the migration, my best shot is to upgrade thecluster, but I have to understand if newer libcephfs client librariessupport old Ceph clusters.

Also, for "big" clusters (20-30 nodes), I'm using sctp protocol now, instead 
udp. for me , it's a lot more reliable when you have a network saturation on 1 now.

(I had the case of interne  udp flood attack coming from outside on 1 on my 
node, lagging the whole corosync cluster).²


corosync.conf

totem {
    cluster_name: ....
    ....
   interface {
       knet_transport: sctp
       linknumber: 0
   }
   ....


(This need a full restart of corosync everywhere, and HA need to be disable 
before, because udp can't communite with sctp, so you'll have a loss of quorum 
during the change)

I've read about it, I think I'll follow your suggestion. In those biggercluster have you tinker with corosync values as "token" or"token_retransmits_before_loss_const"?


Thank you!

Iztok


_______________________________________________
pve-user mailing list
[email protected]
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user

Re: [PVE-User] Corosync and Cluster reboot

Reply via email to