Hi Folks, I have a v5.4.13, 13 node cluster, has been in production running fine, separate networks for storage, data, heartbeat, management. All linux bonded interfaces to Cisco 3750G LACP port channels, all Gigabit. Last evening I started maintenance to update to the latest 5.4 release version in preparation for upgrading to 6.2.
First 3 nodes went OK, as expected, no issues. When I started to migrate some VMs back to a node I just upgraded, the whole cluster crashed and all 13 nodes rebooted. After recovering, two nodes network bonds were blocking and several VMs were locked (migration). I was able to recover the cluster and all VMs ok, overall the system was pretty resilient didn't take too long to get everything restored. This morning I started diagnosing what had happened and found at the time of the all node reboot, common log message in all nodes at relatively the same time: Aug 17 20:36:07 pbxpve01 corosync[2184]: notice [TOTEM ] Retransmit List: 2d5 2d6 2d7 2d8 2e3 2e4 2e5 2e7 2e8 2ea 2eb 2ec 2ed 2ee 2f1 2f2 Aug 17 20:36:07 pbxpve01 corosync[2184]: [TOTEM ] Retransmit List: 2d5 2d6 2d7 2d8 2e3 2e4 2e5 2e7 2e8 2ea 2eb 2ec 2ed 2ee 2f1 2f2 Aug 17 20:36:07 pbxpve01 corosync[2184]: notice [TOTEM ] Retransmit List: 307 308 309 30a 30d 30e 30f 313 314 315 316 317 31b 31c 31d 31e Aug 17 20:36:07 pbxpve01 corosync[2184]: [TOTEM ] Retransmit List: 307 308 309 30a 30d 30e 30f 313 314 315 316 317 31b 31c 31d 31e Aug 17 20:36:07 pbxpve01 corosync[2184]: notice [TOTEM ] Retransmit List: 34a 353 35b 35c 35d 35e 35f 360 364 365 366 368 369 36a 36b 36c ......... I did some reading and I understand there was some sort of heartbeat network latency introduced during the live migration event. But since my networks are separate, does the VM memory transfer between nodes performed on the heartbeat network? Can I specify what network to use for migration, like storage (jumbo frame enabled) or management to relieve any congestion on the heartbeat network segment? Another question is tuning, should I try to tune corosync '<totem netmtu="1480"/>' or '<totem window_size="170"/>' settings or just push through the upgrade to 6.2? Any suggestions are welcome. Thanks. JR -- JR Richardson Engineering for the Masses Chasing the Azeotrope _______________________________________________ pve-user mailing list [email protected] https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
