Hello,

I am running a 3-node Pacemaker (1.1.8) + Corosync (2.1.0) cluster on Ubuntu 
12.04. Two of the nodes are "real" nodes, hosting a DRBD filesystem mount and 
some daemons:
http://pastebin.com/n1sNMhuE
The third node cannot run resources and acts as a quorum node in standby.

Recently, the nodes will all change to the "pending" state, and may remain 
there for quite some time (many days) before coming back online (if ever). 
Using "crm node clearstate" does not help.

Tonight I stopped pacemaker and corosync on all nodes, emptied the contents of 
/var/lib/pacemaker/cib, /var/lib/pacemaker/pengine, and /var/lib/corosync. 
After doing so, I restarted corosync and pacemaker on all of the nodes, and 
repopulated the CIB once the nodes all joined. This worked in restoring the 
nodes states to "online", however after a few minutes, the nodes all went back 
into "pending", this time only for around 5 minutes. Here's the log from the 
current DC:
http://pastebin.com/xhfsb15d

There do not appear to be any faults in the corosync rings:
RING ID 0
        id      = 192.168.1.170
        status  = ring 0 active with no faults
RING ID 1
        id      = 192.168.7.170
        status  = ring 1 active with no faults

corosync.conf:
http://pastebin.com/DQUNdp9f

Some common messages I am seeing in the log:
Peer is not part of our cluster
Diff 2.106.7 -> 2.106.8 from vcs1 not applied to 2.105.12: current "epoch" is 
less than required (epoch, admin_epoch, and num_updates all appear in this 
message)
What do these messages mean? Do they indicate a problem?

Do you have any ideas on what may be causing this behavior?

Thanks,

Andrew

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to