Just to clarify, I had a similar issue in a low latency network with 12 nodes cluster, all with 1G ethernet card. After adding this token_retransmit to corosync.conf, no more problems. Perhaps that could help you.
Em ter., 7 de jan. de 2025 às 09:01, Gilberto Ferreira < gilberto.nune...@gmail.com> escreveu: > Try to add this in corosync.conf in one of the nodes: token_retransmit: > 200 > > > > > > > > Em ter., 7 de jan. de 2025 às 08:24, Iztok Gregori < > iztok.greg...@elettra.eu> escreveu: > >> Hi to all! >> >> I need some help to understand a situation (cluster reboot) which >> happened to us previous week. We are running a 17 nodes Proxmox cluster >> with a separate Ceph cluster for storage (no hyper-convergence). >> >> We have to upgrade a stack a 2 switches and in order to avoid any >> downtime we decided to prepare a new (temporary) stack and move the >> links from one switch to the other. Our procedure was the following: >> >> - Migrate all the VM from node. >> - Unplug the links from the old switch. >> - Plug the links to the temporary switch. >> - Wait till the node is available again in the cluster. >> - Repeat. >> >> We have to move 8 nodes from one switch to the other. The first 4 nodes >> went smoothly, but when we did plug the 5th node into the new switch ALL >> the nodes which have configured HA VMs rebooted! >> >> From the Corosync logs I see that the Token wasn't received and because >> of that watchdog-mux wasn't updated causing the node reboot. >> >> Here are the Corosync logs during the procedure and before the nodes >> restarted. It was captured from a node which didn't reboot (pve-ha-lrm: >> idle): >> >> > 12:51:57 [KNET ] link: host: 18 link: 0 is down >> > 12:51:57 [KNET ] host: host: 18 (passive) best link: 0 (pri: 1) >> > 12:51:57 [KNET ] host: host: 18 has no active links >> > 12:52:02 [TOTEM ] Token has not been received in 9562 ms >> > 12:52:16 [QUORUM] Sync members[16]: 1 2 3 4 7 8 9 10 11 12 13 14 15 16 >> 17 19 >> > 12:52:16 [QUORUM] Sync left[1]: 18 >> > 12:52:16 [TOTEM ] A new membership (1.d29) was formed. Members left: 18 >> > 12:52:16 [TOTEM ] Failed to receive the leave message. failed: 18 >> > 12:52:16 [QUORUM] Members[16]: 1 2 3 4 7 8 9 10 11 12 13 14 15 16 17 19 >> > 12:52:16 [MAIN ] Completed service synchronization, ready to provide >> service. >> > 12:52:42 [KNET ] rx: host: 18 link: 0 is up >> > 12:52:42 [KNET ] host: host: 18 (passive) best link: 0 (pri: 1) >> > 12:52:50 [TOTEM ] Token has not been received in 9567 ms >> > 12:53:01 [TOTEM ] Token has not been received in 20324 ms >> > 12:53:11 [QUORUM] Sync members[16]: 1 2 3 4 7 8 9 10 11 12 13 14 15 16 >> 17 19 >> > 12:53:11 [TOTEM ] A new membership (1.d35) was formed. Members >> > 12:53:20 [TOTEM ] Token has not been received in 9570 ms >> > 12:53:31 [TOTEM ] Token has not been received in 20326 ms >> > 12:53:41 [QUORUM] Sync members[16]: 1 2 3 4 7 8 9 10 11 12 13 14 15 16 >> 17 19 >> > 12:53:41 [TOTEM ] A new membership (1.d41) was formed. Members >> > 12:53:50 [TOTEM ] Token has not been received in 9570 ms >> >> And here you can find the logs of a successfully completed "procedure": >> >> > 12:19:12 [KNET ] link: host: 19 link: 0 is down >> > 12:19:12 [KNET ] host: host: 19 (passive) best link: 0 (pri: 1) >> > 12:19:12 [KNET ] host: host: 19 has no active links >> > 12:19:17 [TOTEM ] Token has not been received in 9562 ms >> > 12:19:31 [QUORUM] Sync members[16]: 1 2 3 4 7 8 9 10 11 12 13 14 15 16 >> 17 18 >> > 12:19:31 [QUORUM] Sync left[1]: 19 >> > 12:19:31 [TOTEM ] A new membership (1.d21) was formed. Members left: 19 >> > 12:19:31 [TOTEM ] Failed to receive the leave message. failed: 19 >> > 12:19:31 [QUORUM] Members[16]: 1 2 3 4 7 8 9 10 11 12 13 14 15 16 17 18 >> > 12:19:31 [MAIN ] Completed service synchronization, ready to provide >> service. >> > 12:19:47 [KNET ] rx: host: 19 link: 0 is up >> > 12:19:47 [KNET ] host: host: 19 (passive) best link: 0 (pri: 1) >> > 12:19:50 [QUORUM] Sync members[17]: 1 2 3 4 7 8 9 10 11 12 13 14 15 16 >> 17 18 19 >> > 12:19:50 [QUORUM] Sync joined[1]: 19 >> > 12:19:50 [TOTEM ] A new membership (1.d25) was formed. Members joined: >> 19 >> > 12:19:51 [QUORUM] Members[17]: 1 2 3 4 7 8 9 10 11 12 13 14 15 16 17 18 >> 19 >> > 12:19:51 [MAIN ] Completed service synchronization, ready to provide >> service. >> >> Comparing the 2 logs I can see that after the "host: 18" link was found >> active again the token was not received, but I cannot figure out what >> went different in this case. >> >> I have 2 possible culprits: >> >> 1. NETWORK >> >> The cluster network is backed up with 5 Extreme Networks switches, 3 >> stacks of two x870 (100GBE), 1 stack of two x770 (40GBE) and one >> temporary stack of two 7720-32C (100GBE). The switches are linked >> together by a 2x LACP bond, and the 99% of the cluster communication are >> on 100GBE. >> >> The hosts are connected to the network with different speed interfaces: >> 10GBE (1 node), 25GBE (4 nodes), 40GBE (1 node), 100GBE (11 nodes). All >> the nodes are bonded, the Corosync network (is the same as the >> management one) is defined on a bridge interface on the bonded link >> (configuration is almost the same on all nodes, some older ones have >> balance-xor the other have lacp as bonding mode). >> >> Is it possible that there is something wrong with the network, but I >> cannot find a probable cause. From the data that I have, I don't see >> nothing special, no links were saturated, no error logged... >> >> 2. COROSYNC >> >> The cluster is running a OLD version of Proxmox (7.1-12) with Corosync >> 3.1.5-pve2. Is possible that there is a problem in Corosync fixed in a >> later release. I did a quick search but I didn't found anything. The >> cluster upgrade is on my to-do list (but the list is huge, so it will >> not be done tomorrow). >> >> We are running only one Corosync network which is the same as the >> management/migration one, but different from the one for >> client/storage/backup. The configuration is very basic, I think is the >> default one, I can provide it if needed. >> >> I checked the Corosync stats and the average latency is around 150 >> (microseconds?) along all links on all nodes. >> >> ==== >> >> In general it can be a combination of the 2 above or something >> completely different. >> >> Do you have some advice on where to look to debug further? >> >> I can provide more information if needed. >> >> Thanks a lot! >> >> Iztok >> >> >> >> -- >> Iztok Gregori >> ICT Systems and Services >> Elettra - Sincrotrone Trieste S.C.p.A. >> Telephone: +39 040 3758948 >> http://www.elettra.eu >> >> _______________________________________________ >> pve-user mailing list >> pve-user@lists.proxmox.com >> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user >> >> _______________________________________________ pve-user mailing list pve-user@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user