Thanks for taking a look! Discussed with HD off-list:
- having the justification for the recommendations in the docs is good - but since the justification somewhat complex, probably not good to have it directly in the beginning of the new section. - It might be better to have the recommendations first ("We recommend [...]" plus the list of bond modes), and the justification below that, so readers immediately see the important part, and can optionally still read about the justification. Hence I'll send v3 that rearranges the paragraphs. On 28/07/2025 18:16, Hannes Duerr wrote: > > On 7/25/25 4:03 PM, Friedrich Weber wrote: >> +Corosync Over Bonds >> +~~~~~~~~~~~~~~~~~~~ >> + >> +Using a xref:sysadmin_network_bond[bond] as a Corosync link can be >> problematic >> +in certain failure scenarios. If one of the bonded interfaces fails and >> stops >> +transmitting packets, but its link state stays up, and there are no other >> +Corosync links available > I thought it can also occur if the are still other Corosync links available? In my tests so far, it didn't. Even if the bond is the primary corosync link, as long as there is still a fallback link available, corosync seems to simply switch over to the fallback link. Here 172.16.0.0/24 is the LACP-bonded network, and I stopped traffic on one bonded NIC of node 2. corosync just says: On node 1: Jul 29 11:31:39 pve1 corosync[841]: [KNET ] link: host: 2 link: 0 is down Jul 29 11:31:39 pve1 corosync[841]: [KNET ] host: host: 2 (passive) best link: 1 (pri: 1) Jul 29 11:31:39 pve1 corosync[841]: [KNET ] host: host: 2 (passive) best link: 1 (pri: 1) On node 2: Jul 29 11:31:39 pve2 corosync[837]: [KNET ] link: host: 4 link: 0 is down Jul 29 11:31:39 pve2 corosync[837]: [KNET ] link: host: 1 link: 0 is down Jul 29 11:31:39 pve2 corosync[837]: [KNET ] host: host: 4 (passive) best link: 1 (pri: 1) Jul 29 11:31:39 pve2 corosync[837]: [KNET ] host: host: 1 (passive) best link: 1 (pri: 1) And nothing on the other nodes. corosync-cfgtool reports the following on the four nodes, note that only the 1<->2 link is not "connected": Local node ID 1, transport knet nodeid: 2 reachable LINK: 0 udp (172.16.0.101->172.16.0.102) enabled mtu: 1397 LINK: 1 udp (192.168.0.101->192.168.0.102) enabled connected mtu: 1397 nodeid: 3 reachable LINK: 0 udp (172.16.0.101->172.16.0.103) enabled connected mtu: 1397 LINK: 1 udp (192.168.0.101->192.168.0.103) enabled connected mtu: 1397 nodeid: 4 reachable LINK: 0 udp (172.16.0.101->172.16.0.104) enabled connected mtu: 1397 LINK: 1 udp (192.168.0.101->192.168.0.104) enabled connected mtu: 1397 Local node ID 2, transport knet nodeid: 1 reachable LINK: 0 udp (172.16.0.102->172.16.0.101) enabled mtu: 1397 LINK: 1 udp (192.168.0.102->192.168.0.101) enabled connected mtu: 1397 nodeid: 3 reachable LINK: 0 udp (172.16.0.102->172.16.0.103) enabled connected mtu: 1397 LINK: 1 udp (192.168.0.102->192.168.0.103) enabled connected mtu: 1397 nodeid: 4 reachable LINK: 0 udp (172.16.0.102->172.16.0.104) enabled mtu: 1397 LINK: 1 udp (192.168.0.102->192.168.0.104) enabled connected mtu: 1397 Local node ID 3, transport knet nodeid: 1 reachable LINK: 0 udp (172.16.0.103->172.16.0.101) enabled connected mtu: 1397 LINK: 1 udp (192.168.0.103->192.168.0.101) enabled connected mtu: 1397 nodeid: 2 reachable LINK: 0 udp (172.16.0.103->172.16.0.102) enabled connected mtu: 1397 LINK: 1 udp (192.168.0.103->192.168.0.102) enabled connected mtu: 1397 nodeid: 4 reachable LINK: 0 udp (172.16.0.103->172.16.0.104) enabled connected mtu: 1397 LINK: 1 udp (192.168.0.103->192.168.0.104) enabled connected mtu: 1397 Local node ID 4, transport knet nodeid: 1 reachable LINK: 0 udp (172.16.0.104->172.16.0.101) enabled connected mtu: 1397 LINK: 1 udp (192.168.0.104->192.168.0.101) enabled connected mtu: 1397 nodeid: 2 reachable LINK: 0 udp (172.16.0.104->172.16.0.102) enabled mtu: 1397 LINK: 1 udp (192.168.0.104->192.168.0.102) enabled connected mtu: 1397 nodeid: 3 reachable LINK: 0 udp (172.16.0.104->172.16.0.103) enabled connected mtu: 1397 LINK: 1 udp (192.168.0.104->192.168.0.103) enabled connected mtu: 1397 With `bond-lacp-rate slow`, this switches over to "connected" for all four interfaces after ~90 seconds. _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel