Sorry my mail was sent too quickly. It misses some required logs.

regarding syslog, some extracts here for corosync (the following log block will be repeated in a loop, but with different numbers):

May  6 18:38:02 inf-proxmox6 corosync[2674]:   [KNET  ] link: host: 4 link: 0 is down May  6 18:38:02 inf-proxmox6 corosync[2674]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1) May  6 18:38:02 inf-proxmox6 corosync[2674]:   [KNET  ] host: host: 4 has no active links May  6 18:38:05 inf-proxmox6 corosync[2674]:   [KNET  ] rx: host: 4 link: 0 is up May  6 18:38:05 inf-proxmox6 corosync[2674]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1) May  6 18:38:10 inf-proxmox6 corosync[2674]:   [KNET  ] link: host: 3 link: 0 is down May  6 18:38:10 inf-proxmox6 corosync[2674]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1) May  6 18:38:10 inf-proxmox6 corosync[2674]:   [KNET  ] host: host: 3 has no active links May  6 18:38:12 inf-proxmox6 corosync[2674]:   [KNET  ] rx: host: 3 link: 0 is up May  6 18:38:12 inf-proxmox6 corosync[2674]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
May  6 18:38:12 inf-proxmox6 corosync[2674]:   [TOTEM ] Retransmit List: 64
May  6 18:38:18 inf-proxmox6 corosync[2674]:   [KNET  ] link: host: 4 link: 0 is down May  6 18:38:18 inf-proxmox6 corosync[2674]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1) May  6 18:38:18 inf-proxmox6 corosync[2674]:   [KNET  ] host: host: 4 has no active links May  6 18:38:19 inf-proxmox6 corosync[2674]:   [KNET  ] link: host: 3 link: 0 is down May  6 18:38:19 inf-proxmox6 corosync[2674]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1) May  6 18:38:19 inf-proxmox6 corosync[2674]:   [KNET  ] host: host: 3 has no active links May  6 18:38:20 inf-proxmox6 corosync[2674]:   [KNET  ] rx: host: 4 link: 0 is up May  6 18:38:20 inf-proxmox6 corosync[2674]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1) May  6 18:38:21 inf-proxmox6 corosync[2674]:   [KNET  ] rx: host: 3 link: 0 is up May  6 18:38:21 inf-proxmox6 corosync[2674]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1) May  6 18:38:29 inf-proxmox6 corosync[2674]:   [KNET  ] link: host: 3 link: 0 is down May  6 18:38:29 inf-proxmox6 corosync[2674]:   [KNET  ] link: host: 4 link: 0 is down May  6 18:38:29 inf-proxmox6 corosync[2674]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1) May  6 18:38:29 inf-proxmox6 corosync[2674]:   [KNET  ] host: host: 3 has no active links May  6 18:38:29 inf-proxmox6 corosync[2674]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1) May  6 18:38:29 inf-proxmox6 corosync[2674]:   [KNET  ] host: host: 4 has no active links May  6 18:38:31 inf-proxmox6 corosync[2674]:   [TOTEM ] Token has not been received in 107 ms May  6 18:38:31 inf-proxmox6 corosync[2674]:   [KNET  ] rx: host: 3 link: 0 is up May  6 18:38:31 inf-proxmox6 corosync[2674]:   [KNET  ] rx: host: 4 link: 0 is up May  6 18:38:31 inf-proxmox6 corosync[2674]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1) May  6 18:38:31 inf-proxmox6 corosync[2674]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
May  6 18:38:42 inf-proxmox6 corosync[2674]:   [TOTEM ] Retransmit List: fd
May  6 18:38:42 inf-proxmox6 corosync[2674]:   [TOTEM ] Retransmit List: 100 May  6 18:38:42 inf-proxmox6 corosync[2674]:   [TOTEM ] Retransmit List: 101 May  6 18:38:42 inf-proxmox6 corosync[2674]:   [TOTEM ] Retransmit List: 102 May  6 18:38:42 inf-proxmox6 corosync[2674]:   [TOTEM ] Retransmit List: 103 May  6 18:38:42 inf-proxmox6 corosync[2674]:   [TOTEM ] Retransmit List: 104 May  6 18:38:42 inf-proxmox6 corosync[2674]:   [TOTEM ] Retransmit List: 106 May  6 18:38:42 inf-proxmox6 corosync[2674]:   [TOTEM ] Retransmit List: 107 May  6 18:38:42 inf-proxmox6 corosync[2674]:   [TOTEM ] Retransmit List: 108 May  6 18:38:42 inf-proxmox6 corosync[2674]:   [TOTEM ] Retransmit List: 109 May  6 18:38:44 inf-proxmox6 corosync[2674]:   [KNET  ] link: host: 3 link: 0 is down May  6 18:38:44 inf-proxmox6 corosync[2674]:   [KNET  ] link: host: 4 link: 0 is down May  6 18:38:44 inf-proxmox6 corosync[2674]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1) May  6 18:38:44 inf-proxmox6 corosync[2674]:   [KNET  ] host: host: 3 has no active links May  6 18:38:44 inf-proxmox6 corosync[2674]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1) May  6 18:38:44 inf-proxmox6 corosync[2674]:   [KNET  ] host: host: 4 has no active links May  6 18:38:46 inf-proxmox6 corosync[2674]:   [TOTEM ] Token has not been received in 106 ms May  6 18:38:46 inf-proxmox6 corosync[2674]:   [KNET  ] rx: host: 4 link: 0 is up May  6 18:38:46 inf-proxmox6 corosync[2674]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1) May  6 18:38:47 inf-proxmox6 corosync[2674]:   [KNET  ] rx: host: 3 link: 0 is up May  6 18:38:47 inf-proxmox6 corosync[2674]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1) May  6 18:38:51 inf-proxmox6 corosync[2674]:   [TOTEM ] Token has not been received in 4511 ms May  6 18:38:52 inf-proxmox6 corosync[2674]:   [TOTEM ] A new membership (1.ea8) was formed. Members May  6 18:38:52 inf-proxmox6 corosync[2674]:   [CPG   ] downlist left_list: 0 received May  6 18:38:52 inf-proxmox6 corosync[2674]:   [CPG   ] downlist left_list: 0 received May  6 18:38:52 inf-proxmox6 corosync[2674]:   [CPG   ] downlist left_list: 0 received May  6 18:38:52 inf-proxmox6 corosync[2674]:   [CPG   ] downlist left_list: 0 received
May  6 18:38:52 inf-proxmox6 corosync[2674]:   [QUORUM] Members[4]: 1 3 4 5
May  6 18:38:52 inf-proxmox6 corosync[2674]:   [MAIN  ] Completed service synchronization, ready to provide service.

Nothing really relevant regarding the pve-cluster in the logs as it marks succeed ?..for instance here:

May  6 22:17:33 inf-proxmox6 systemd[1]: Stopping The Proxmox VE cluster filesystem... May  6 22:17:33 inf-proxmox6 pmxcfs[2561]: [main] notice: teardown filesystem May  6 22:17:33 inf-proxmox6 pvestatd[2906]: status update time (19.854 seconds)
May  6 22:17:34 inf-proxmox6 systemd[7888]: etc-pve.mount: Succeeded.
May  6 22:17:34 inf-proxmox6 systemd[1]: etc-pve.mount: Succeeded.
May  6 22:17:34 inf-proxmox6 pvestatd[2906]: rados_connect failed - Operation not supported May  6 22:17:34 inf-proxmox6 pvestatd[2906]: rados_connect failed - Operation not supported May  6 22:17:34 inf-proxmox6 pvestatd[2906]: rados_connect failed - Operation not supported May  6 22:17:34 inf-proxmox6 pvestatd[2906]: rados_connect failed - Operation not supported May  6 22:17:35 inf-proxmox6 pmxcfs[2561]: [main] notice: exit proxmox configuration filesystem (0)
May  6 22:17:35 inf-proxmox6 systemd[1]: pve-cluster.service: Succeeded.
May  6 22:17:35 inf-proxmox6 systemd[1]: Stopped The Proxmox VE cluster filesystem. May  6 22:17:35 inf-proxmox6 systemd[1]: Starting The Proxmox VE cluster filesystem... May  6 22:17:35 inf-proxmox6 pmxcfs[8260]: [status] notice: update cluster info (cluster name  cluster-proxmox, version = 6) May  6 22:17:35 inf-proxmox6 corosync[8007]:   [TOTEM ] A new membership (1.1998) was formed. Members joined: 2 3 4 5 May  6 22:17:36 inf-proxmox6 systemd[1]: Started The Proxmox VE cluster filesystem.

Here is another extract that shows also some slow ops on a Ceph osd:

May  6 18:38:59 inf-proxmox6 corosync[2674]:   [TOTEM ] Token has not been received in 3810 ms May  6 18:39:00 inf-proxmox6 systemd[1]: Starting Proxmox VE replication runner... May  6 18:39:01 inf-proxmox6 ceph-mon[1119484]: 2020-05-06 18:39:01.493 7feaed4bb700 -1 mon.0@0(leader) e6 get_health_metrics reporting 46 slow ops, oldest is osd_failure(failed timeout osd.5 [v2:192.168.217.8:6884/1879695,v1:192.168.217.8:6885/1879695] for 20sec e73191 v73191) May  6 18:39:02 inf-proxmox6 corosync[2674]:   [TOTEM ] A new membership (1.eb4) was formed. Members May  6 18:39:02 inf-proxmox6 corosync[2674]:   [CPG   ] downlist left_list: 0 received May  6 18:39:02 inf-proxmox6 corosync[2674]:   [CPG   ] downlist left_list: 0 received May  6 18:39:02 inf-proxmox6 corosync[2674]:   [CPG   ] downlist left_list: 0 received May  6 18:39:02 inf-proxmox6 corosync[2674]:   [CPG   ] downlist left_list: 0 received
May  6 18:39:02 inf-proxmox6 corosync[2674]:   [QUORUM] Members[4]: 1 3 4 5
May  6 18:39:02 inf-proxmox6 corosync[2674]:   [MAIN  ] Completed service synchronization, ready to provide service. May  6 18:39:02 inf-proxmox6 pvesr[1409653]: trying to acquire cfs lock 'file-replication_cfg' ... May  6 18:39:03 inf-proxmox6 pvesr[1409653]: trying to acquire cfs lock 'file-replication_cfg' ...
May  6 18:39:06 inf-proxmox6 systemd[1]: pvesr.service: Succeeded.
May  6 18:39:06 inf-proxmox6 systemd[1]: Started Proxmox VE replication runner. May  6 18:39:06 inf-proxmox6 ceph-mon[1119484]: 2020-05-06 18:39:06.493 7feaed4bb700 -1 mon.0@0(leader) e6 get_health_metrics reporting 46 slow ops, oldest is osd_failure(failed timeout osd.5 [v2:192.168.217.8:6884/1879695,v1:192.168.217.8:6885/1879695] for 20sec e73191 v73191)

In case of, if that makes any sense for someone, thank you again,
Hervé

On 11/05/2020 17:58, Herve Ballans wrote:

Hi Eneko,

Thanks for your answer. I was also thinking at first a network issue but physical network equipments don't seem to be showing any specific problems...Here are more details on the cluster:

2x10Gb + 2x1Gb interface:

  * a 10Gb interface for ceph cluster
  * a 10Gb interface for main network cluster
  * the other 2 1Gb interfaces are used for two other VLAN for the VMs



On 11/05/2020 10:39, Eneko Lacunza wrote:
Hi Hervé,

This seems a network issue. What is the network setup in this cluster? What logs in syslog about corosync and pve-cluster?

Don't enable HA until you have a stable cluster quorum.

Cheers
Eneko

El 11/5/20 a las 10:35, Herve Ballans escribió:
Hi everybody,

I would like to take the opportunity at the beginning of this new week to ask my issue again.

Has anyone had any idea why a such problem occurred, or is this problem really something new ?

Thanks again,
Hervé

On 07/05/2020 18:28, Herve Ballans wrote:
Hi all,

*Cluster info:*

 * 5 nodes (version PVE 6.1-3 at the time the problem occured)
 * Ceph rbd storage (Nautilus)
 * In production since many years with no major issues
 * No specific network problems at the time the problem occured
 * Nodes are on the same date (configured with the same ntp server)

*Symptoms:*

Suddenly, last night (around 7 PM), all nodes of our cluster seems to have rebooted in the same time with no apparent reasons (I mean, we weren't doing antything on it) ! During the reboot, services "Corosync Cluster Engine" and "Proxmox VE replication runer" failed. After node rebooted, we are obliged to start those services manually.

Once rebooted with all pve services, some nodes were in HA lrm status : old timestamp - dead? while others were in active status or in wait_for_agent_lock status ?... Nodes switch states regularly...and it loops back and forth as long as we don't change the configuration...

In the same time, pve-ha-crm service got unexpected error, as for example : "Configuration file 'nodes/inf-proxmox6/qemu-server/501.conf' does not exist" even though the file exists but on an another node ! Such message is probably a consequence of the fencing between nodes due to the change of status...

*What we have tried until now to stabilize the situation:*

After several investigations and several operations that have failed to solve anything (in particular a complete upgrade to the latest PVE version 6.1-11),

we finally removed the HA configuration of all the VM.
Since, the state seems to be stabilized although, obviously, it is not nominal !

Now, all the nodes are in HA lrm status : idle and sometimes switch to old timestamp - dead? state, then come back to idle state.
None of them are in "active" state.
Obviously, quorum status is "no quorum"

It will be noted that, as soon as we try to re-activate the HA status on the VMs, problem occurs again (nodes reboot!) :(

*Question:*

Have you ever experienced such a problem or do you know a way to restore a correct HA configuration in this case ?
I point out that nodes are currently on version PVE 6.1-11.

I can put some specific logs if useful.

Thanks in advance for your help,
Hervé

_______________________________________________
pve-user mailing list
pve-user@pve.proxmox.com
https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
_______________________________________________
pve-user mailing list
pve-user@pve.proxmox.com
https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user


_______________________________________________
pve-user mailing list
pve-user@pve.proxmox.com
https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user

Reply via email to