Hi,
On 03/08/2017 11:02 AM, Daniel wrote:
HI,
the Cluster was working all the time pretty cool.
Yes, but if this particular node acted as a querier the cluster would
have worked great
but removing it results in no more querier and so problems.
It's at least worth a try to look this up, simple test would be:
Executing on at least two nodes:
> omping -c 600 -i 1 -q NODE1-IP NODE2-IP ...
This runs for ~10Minutes and schould have ideally 0% loss, at least <1%.
See:
http://pve.proxmox.com/pve-docs/chapter-pvecm.html#cluster-network-requirements
So actually I found out that PVE File-System is not mounted. And here you also
can see some logs you ask for ;)
Thanks! Do you have tried restarting corosync and then pve-cluster?
This is not entirely safe with active HA, but I guess you do not have HA
configured or else the watchdog would have already triggered.
● corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled)
Active: active (running) since Fri 2017-02-17 15:59:11 CET; 2 weeks 4 days
ago
Main PID: 2083 (corosync)
CGroup: /system.slice/corosync.service
└─2083 corosync
Mar 08 09:41:28 host01 corosync[2083]: [MAIN ] Completed service
synchronization, ready to provide service.
Mar 08 09:41:32 host01 corosync[2083]: [TOTEM ] A new membership
(10.0.2.110:112748) was formed. Members
Mar 08 09:41:32 host01 corosync[2083]: [QUORUM] Members[12]: 1 2 3 4 5 6 7 8 9
10 12 13
Mar 08 09:41:32 host01 corosync[2083]: [MAIN ] Completed service
synchronization, ready to provide service.
Mar 08 09:41:39 host01 corosync[2083]: [TOTEM ] A new membership
(10.0.2.110:112756) was formed. Members joined: 13 left: 13
Mar 08 09:41:39 host01 corosync[2083]: [TOTEM ] Failed to receive the leave
message. failed: 13
Mar 08 09:41:39 host01 corosync[2083]: [QUORUM] Members[12]: 1 2 3 4 5 6 7 8 9
10 12 13
Mar 08 09:41:39 host01 corosync[2083]: [MAIN ] Completed service
synchronization, ready to provide service.
Mar 08 09:41:58 host01 corosync[2083]: [TOTEM ] A new membership
(10.0.2.110:112760) was formed. Members left: 13
Mar 08 09:41:58 host01 corosync[2083]: [TOTEM ] Failed to receive the leave
message. failed: 13
● pve-cluster.service - The Proxmox VE cluster filesystem
Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled)
Active: failed (Result: signal) since Wed 2017-03-08 10:54:06 CET; 6min ago
Process: 22861 ExecStart=/usr/bin/pmxcfs $DAEMON_OPTS (code=exited,
status=0/SUCCESS)
Main PID: 22868 (code=killed, signal=KILL)
Mar 08 10:54:01 host01 pmxcfs[22868]: [dcdb] notice: cpg_join retry 950
Mar 08 10:54:02 host01 pmxcfs[22868]: [dcdb] notice: cpg_join retry 960
Mar 08 10:54:03 host01 pmxcfs[22868]: [dcdb] notice: cpg_join retry 970
Mar 08 10:54:04 host01 pmxcfs[22868]: [dcdb] notice: cpg_join retry 980
Mar 08 10:54:05 host01 pmxcfs[22868]: [dcdb] notice: cpg_join retry 990
Mar 08 10:54:06 host01 pmxcfs[22868]: [dcdb] notice: cpg_join retry 1000
Mar 08 10:54:06 host01 systemd[1]: pve-cluster.service stop-sigterm timed out.
Killing.
Mar 08 10:54:06 host01 systemd[1]: pve-cluster.service: main process exited,
code=killed, status=9/KILL
Mar 08 10:54:06 host01 systemd[1]: Failed to start The Proxmox VE cluster
filesystem.
Mar 08 10:54:06 host01 systemd[1]: Unit pve-cluster.service entered failed
state.
It seems that TOTEM ] Failed to receive the leave message. failed: 13 was the
problem.
This could really indicate multicast problems (see above).
Did the problems happened instantly after the removal of the node? With
some minutes delay?
And how did you remove the one node?
Just trying to understand your situation here :)
cheers,
Thomas
_______________________________________________
pve-user mailing list
pve-user@pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user