--- Begin Message --- I'd suggest a direct link between the hosts for another quorum ring if You have a spare network port. Also multiple rings could be more resilient than MLAG. But that is only my 2 cents opinion.

see: https://pve.proxmox.com/pve-docs/chapter-pvecm.html#pvecm_redundancy
and: https://pve.proxmox.com/pve-docs/chapter-pvecm.html#pvecm_corosync_over_bonds

On 8/19/2025 5:08 PM, Marco Gaiarin wrote:
We have some couples of servers in some local branch of our organization,
in cluster but clearly not in failover (or 'automatic failover'); this is
intended.

Most of these branch offices close for summer holidays, when power outgage
flourish. ;-)
Rather frequently all the site get powered off, UPS do they job but sooner
or later shutdown servers (and all other equipment) until some local
employer goes to the site and re-power up all the site.

The server are organized with two UPS (one per sever); the UPS power also a
stack of two catalyst 2960S switches (again, one UPS per switches); all the
server have a trunk/bond for every interface, a cable on switch1 and a cable
on switch2 in the stack.


We have recently upgraded to PVE 8, and found that if all the site get
powered off, sometime but with a decent frequency, only some VMs get powered
on.


Digging the culprint we have found:

  2025-08-07T10:49:19.997751+02:00 pdpve1 systemd[1]: Starting 
pve-guests.service - PVE guests...
  2025-08-07T10:49:20.792333+02:00 pdpve1 pve-guests[2392]: <root@pam> starting 
task UPID:pdpve1:00000959:0000117F:68946890:startall::root@pam:
  2025-08-07T10:49:20.794446+02:00 pdpve1 pvesh[2392]: waiting for quorum ...
  2025-08-07T10:52:18.584607+02:00 pdpve1 pmxcfs[2021]: [status] notice: node 
has quorum
  2025-08-07T10:52:18.879944+02:00 pdpve1 pvesh[2392]: got quorum
  2025-08-07T10:52:18.891461+02:00 pdpve1 pve-guests[2393]: <root@pam> starting 
task UPID:pdpve1:00000B86:00005711:68946942:qmstart:100:root@pam:
  2025-08-07T10:52:18.891653+02:00 pdpve1 pve-guests[2950]: start VM 100: 
UPID:pdpve1:00000B86:00005711:68946942:qmstart:100:root@pam:
  2025-08-07T10:52:20.103473+02:00 pdpve1 pve-guests[2950]: VM 100 started with 
PID 2960.

so servers restart, get quorum, start VM in order; but suddenly lost quorum:

  2025-08-07T10:53:16.128336+02:00 pdpve1 pmxcfs[2021]: [status] notice: node 
lost quorum
  2025-08-07T10:53:20.901367+02:00 pdpve1 pve-guests[2393]: cluster not ready - 
no quorum?
  2025-08-07T10:53:20.903743+02:00 pdpve1 pvesh[2392]: cluster not ready - no 
quorum?
  2025-08-07T10:53:20.905349+02:00 pdpve1 pve-guests[2392]: <root@pam> end task 
UPID:pdpve1:00000959:0000117F:68946890:startall::root@pam: cluster not ready - no 
quorum?
  2025-08-07T10:53:20.922275+02:00 pdpve1 systemd[1]: Finished 
pve-guests.service - PVE guests.

and subsequent VMs does not run; after some seconds, quorum get back, all
goes normal. But VMs have to be run by hand.


Clearly if we reboot or poweroff the two servers with the switch still
powered on, all works as expected.
We have managed to power on the server and do a reboot of the switch in the
same time, and the trouble get triggered.


So seems that the quorum get lost probably because the switch stop working
for some time doing their things (eg, binding the second unit in the stack
and doing ethernet bonds), that confuse the quorum, bang.

We have tried to add:

        pvenode config set --startall-onboot-delay 120

an the two nodes, do the experiment (eg, start the server and reboot the
switch) and the trouble does not trigger.


Still i'm asking some feedback... particulary:

1) we was on PVE6: something are changed in quorum definition from PVE6 to
   PVE8? Because before upgrading we have never hit this...

2) there are better solution to this?


Thanks.

--
dorsy



--- End Message ---
_______________________________________________
pve-user mailing list
pve-user@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user

Reply via email to