On 07.05.25 17:22, Kevin Schneider wrote:
IMO this isn't strict enough and we should empathize on the importance of the problem. I would go for

To ensure reliable Corosync redundancy, it's essential to use at least two separate physical and logical networks. Single bonded interfaces do not provide Corosync redundancy. When a bonded interface fails without redundancy, it can lead to asymmetric communication, causing all nodes to lose quorum—even if more than half of them can still communicate with each other.


Although a bond on the interface together with MLAG'd switches CAN provide further resiliency in case of switch or single NIC PHY failure. It does not protect against total failure of the NIC of course.


I think adding a "typical topologies" or "example topologies" to the docs might be a good idea?


Below my personal, opinionated recommendation after deploying quite a good amount of Proxmox clusters. Of course I don't expect everyone to agree with this... But hopefully it can serve as a starting point?


Typical topologies:

In most cases, a server for a Proxmox cluster will have at least two physical NICs. One is usually a low or medium speed dual-port onboard NIC (1GBase-T or 10GBase-T). The other one is typically a medium or high speed add-in PCIe NIC (e.g. 10G SFP+, 40G QSFP+, 25G SFP28, 100G QSFP28). There may be more NICs depending on the specific use case, e.g. a separate NIC for Ceph Cluster (private, replication, back-side) traffic.

In such a setup, it is recommended to reserve the low or medium speed onboard NICs for cluster traffic (and potentially management purposes). These NICs should be connected using a switch. Although for very small clusters (3 nodes) and a dual-port NIC a ring topology could be used to connect the nodes together, this is not recommended as it makes later expansion more troublesome.

It is recommended to use a physically separate switch just for the cluster network. If your main switch is the only way for nodes to communicate, failure of this switch will take out your entire cluster with potentially catastrophic consequences.

For single-port onboard NICs there are no further design decisions to make. However, onboard NICs are almost always dual port, which allows some more freedom in the design of the cluster network.

Design of the dedicated cluster network:

a) Two separate cluster switches, switches support MLAG or Stacking / Virtual Chassis This is an ideal scenario, in which you deploy two managed switches in an MLAG or Stacking / Virtual Chassis configuration. MLAG or Stacking / Virtual Chassis requires the switches to have a link between them, called IPL ("Inter Peer Link"). MLAG or Stacking / Virtual Chassis makes two switches behave as if they were one, but if one switch fails, the remaining one will still work and take over seamlessly!

Each cluster node is connected to both switches. Both NIC ports on each node are bonded together (LACP recommended).

This topology provides a very good degree of resiliency.

The bond is configured as Ring0 for corosync.


b) Two separate cluster switches, switches DO NOT support MLAG or Stacking / Virtual Chassis

In this scenario you deploy two separate switches (potentially unmanaged). There should not be a link between the switches, as this can easily lead to loops and makes the entire configuration more complex.

Each cluster node is connected to both switches, but the NIC ports are not bonded together. Typically, both NIC ports will be in separate IP subnets.

This topology provides a slightly smaller degree of resiliency compared to MLAG.

One switch / broadcast domain is configured as Ring0 for corosync, the other one is configured as Ring1.


c) Single separate cluster switch

If you only want to deploy a single switch that is reserved for cluster traffic, you can either use a single NIC port on each node, or both bonded together. It will not make much of a difference, as bonding will only protect against single PHY / port failure.

The interface is configured as Ring0 for corosync.


Usage of the other NICs for redundancy purposes:
It is recommended to add the other NICs / networks in the system as backup links / additional rings to corosync. Bad connectivity over a potentially congested storage network is better than no connectivity at all, because the dedicated cluster network has failed and there is no backup.



_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Reply via email to