On 7/25/25 4:03 PM, Friedrich Weber wrote:
+Corosync Over Bonds
+~~~~~~~~~~~~~~~~~~~
+
+Using a xref:sysadmin_network_bond[bond] as a Corosync link can be problematic
+in certain failure scenarios. If one of the bonded interfaces fails and stops
+transmitting packets, but its link state stays up, and there are no other
+Corosync links available
I thought it can also occur if the are still other Corosync links available?
If i understand the next part correct you're even assuming it?
, some bond modes may cause a state of asymmetric
+connectivity where cluster nodes can only communicate with different subsets of
+other nodes. Affected are bond modes that provide load balancing, as these
+modes may still try to send out a subset of packets via the failed interface.
+In case of asymmetric connectivity, Corosync may not be able to form a stable
+quorum in the cluster.
--- here
If this state persists and HA is enabled, nodes may
+fence themselves, even if their respective bond is still fully functioning
---
. In
+the worst case, the whole cluster may fence itself.
+
+We recommend at least one dedicated physical NIC for the primary Corosync link,
+see xref:pvecm_cluster_requirements[Requirements]. Bonds may be used as
+additional links for increased redundancy. To avoid fencing in the failure
+scenario outlined above, the following caveats apply whenever a bond is used
+for Corosync traffic:
+
+* We *advise against* using bond modes *balance-rr*, *balance-xor*,
+  *balance-tlb*, or *balance-alb* for Corosync traffic. As explained above,
+  they can cause asymmetric connectivity in certain failure scenarios.
+
+* *IEEE 802.3ad (LACP)*: This bond mode can cause asymmetric connectivity in
+  certain failure scenarios as explained above, but it can recover from this
+  state, as each side of the bond (Proxmox VE node and switch) can stop using a
+  bonded interface if it has not received three LACPDUs in a row on it.
+  However, with default settings, LACPDUs are only sent every 30 seconds,
+  yielding a failover time of 90 seconds. This is too long, as nodes with HA
+  resources will fence themselves already after roughly one minute without a
+  stable quorum. If LACP bonds are used for corosync traffic, we recommend
+  setting `bond-lacp-rate fast` *on the Proxmox VE node and the switch*!
+  Setting this option on one side requests the other side to send an LACPDU
+  every second. Setting this option on both sides can reduce the failover time
+  in the scenario above to 3 seconds and thus prevent fencing.
+
+* Bond mode *active-backup* will not cause asymmetric connectivity in the
+  failure scenario described above. The node whose bond experienced the failure
+  may lose connection to the cluster and, if HA is enabled, fence itself.
+
  Separate Cluster Network
  ~~~~~~~~~~~~~~~~~~~~~~~~
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Reply via email to