Re: [PVE-User] Whole cluster brokes

Thomas Lamprecht Wed, 08 Mar 2017 02:16:03 -0800

Hi,

On 03/08/2017 11:02 AM, Daniel wrote:

HI,


the Cluster was working all the time pretty cool.

Yes, but if this particular node acted as a querier the cluster wouldhave worked great

but removing it results in no more querier and so problems.
It's at least worth a try to look this up, simple test would be:

Executing on at least two nodes:

> omping -c 600 -i 1 -q NODE1-IP NODE2-IP ...

This runs for ~10Minutes and schould have ideally 0% loss, at least <1%.

See:
http://pve.proxmox.com/pve-docs/chapter-pvecm.html#cluster-network-requirements

So actually I found out that PVE File-System is not mounted. And here you also 
can see some logs you ask for ;)

Thanks! Do you have tried restarting corosync and then pve-cluster?
This is not entirely safe with active HA, but I guess you do not have HA
configured or else the watchdog would have already triggered.


● corosync.service - Corosync Cluster Engine
    Loaded: loaded (/lib/systemd/system/corosync.service; enabled)
    Active: active (running) since Fri 2017-02-17 15:59:11 CET; 2 weeks 4 days 
ago
  Main PID: 2083 (corosync)
    CGroup: /system.slice/corosync.service
            └─2083 corosync

Mar 08 09:41:28 host01 corosync[2083]: [MAIN  ] Completed service 
synchronization, ready to provide service.
Mar 08 09:41:32 host01 corosync[2083]: [TOTEM ] A new membership 
(10.0.2.110:112748) was formed. Members
Mar 08 09:41:32 host01 corosync[2083]: [QUORUM] Members[12]: 1 2 3 4 5 6 7 8 9 
10 12 13
Mar 08 09:41:32 host01 corosync[2083]: [MAIN  ] Completed service 
synchronization, ready to provide service.
Mar 08 09:41:39 host01 corosync[2083]: [TOTEM ] A new membership 
(10.0.2.110:112756) was formed. Members joined: 13 left: 13
Mar 08 09:41:39 host01 corosync[2083]: [TOTEM ] Failed to receive the leave 
message. failed: 13
Mar 08 09:41:39 host01 corosync[2083]: [QUORUM] Members[12]: 1 2 3 4 5 6 7 8 9 
10 12 13
Mar 08 09:41:39 host01 corosync[2083]: [MAIN  ] Completed service 
synchronization, ready to provide service.
Mar 08 09:41:58 host01 corosync[2083]: [TOTEM ] A new membership 
(10.0.2.110:112760) was formed. Members left: 13
Mar 08 09:41:58 host01 corosync[2083]: [TOTEM ] Failed to receive the leave 
message. failed: 13

● pve-cluster.service - The Proxmox VE cluster filesystem
    Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled)
    Active: failed (Result: signal) since Wed 2017-03-08 10:54:06 CET; 6min ago
   Process: 22861 ExecStart=/usr/bin/pmxcfs $DAEMON_OPTS (code=exited, 
status=0/SUCCESS)
  Main PID: 22868 (code=killed, signal=KILL)

Mar 08 10:54:01 host01 pmxcfs[22868]: [dcdb] notice: cpg_join retry 950
Mar 08 10:54:02 host01 pmxcfs[22868]: [dcdb] notice: cpg_join retry 960
Mar 08 10:54:03 host01 pmxcfs[22868]: [dcdb] notice: cpg_join retry 970
Mar 08 10:54:04 host01 pmxcfs[22868]: [dcdb] notice: cpg_join retry 980
Mar 08 10:54:05 host01 pmxcfs[22868]: [dcdb] notice: cpg_join retry 990
Mar 08 10:54:06 host01 pmxcfs[22868]: [dcdb] notice: cpg_join retry 1000
Mar 08 10:54:06 host01 systemd[1]: pve-cluster.service stop-sigterm timed out. 
Killing.
Mar 08 10:54:06 host01 systemd[1]: pve-cluster.service: main process exited, 
code=killed, status=9/KILL
Mar 08 10:54:06 host01 systemd[1]: Failed to start The Proxmox VE cluster 
filesystem.
Mar 08 10:54:06 host01 systemd[1]: Unit pve-cluster.service entered failed 
state.

It seems that TOTEM ] Failed to receive the leave message. failed: 13 was the 
problem.


This could really indicate multicast problems (see above).

Did the problems happened instantly after the removal of the node? Withsome minutes delay?

And how did you remove the one node?

Just trying to understand your situation here :)

cheers,
Thomas


_______________________________________________
pve-user mailing list
pve-user@pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user

Re: [PVE-User] Whole cluster brokes

Reply via email to