Hi!

On 07/01/25 15:15, DERUMIER, Alexandre wrote:
Personnaly, I'll recommand to disable HA  temporary during the network change  
(mv /etc/pve/ha/resources.cfg  to a tmp directory,  stop all pve-ha-lrm   , 
tehn stop all pve-ha-crm   to stop  the watchdog)

Then, after the migration, check the corosync logs during 1 or 2 days , and 
after that , if no retransmit occur, reenable HA.


Good advice. But with the pve-ha-* services down the "HA-VMs" cannot migrate from a node to the other, because the migration is handled by the HA (or at least that is how I remember to happen some time ago). So I've (temporary) removed all the resources (VMs) from HA, which has the effect to tell "pve-ha-lrm" to disable the watchdog( "watchdog closed (disabled)" ) and no reboot should occur.

It's really possible that it's a corosync bug (I remember to have had this kind 
of error with pve 7.X)

I'm leaning to a similar conclusion, but I'm still lacking in understanding of how corosync/watchdog is handled in Proxmox.

For example I still don't know who is updating the watchdog-mux service? Is corosync (but no "watchdog_device" is set in corosync.conf and by manual "if unset, empty or "off", no watchdog is used.") or is pve-ha-lrm?

I think that, after the migration, my best shot is to upgrade the cluster, but I have to understand if newer libcephfs client libraries support old Ceph clusters.

Also, for "big" clusters (20-30 nodes), I'm using sctp protocol now, instead 
udp. for me , it's a lot more reliable when you have a network saturation on 1 now.

(I had the case of interne  udp flood attack coming from outside on 1 on my 
node, lagging the whole corosync cluster).²


corosync.conf

totem {
    cluster_name: ....
    ....
   interface {
       knet_transport: sctp
       linknumber: 0
   }
   ....


(This need a full restart of corosync everywhere, and HA need to be disable 
before, because udp can't communite with sctp, so you'll have a loss of quorum 
during the change)

I've read about it, I think I'll follow your suggestion. In those bigger cluster have you tinker with corosync values as "token" or "token_retransmits_before_loss_const"?

Thank you!

Iztok


_______________________________________________
pve-user mailing list
pve-user@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user

Reply via email to