This looks like the underlying problem:

Feb 10 23:58:07 [1199] vcsquorum        cib:   notice: cib:diff:        --
    <node uname="vcsquorum.example.com" id="755053578" />
Feb 10 23:58:07 [1199] vcsquorum        cib:   notice: cib:diff:        ++
    <node id="755053578" uname="vcsquorum" />

Something is confused about what the node(s) should be called.

On Mon, Feb 11, 2013 at 6:48 PM, Andrew Martin <amar...@xes-inc.com> wrote:
> Hello,
>
> I am running a 3-node Pacemaker (1.1.8) + Corosync (2.1.0) cluster on Ubuntu 
> 12.04. Two of the nodes are "real" nodes, hosting a DRBD filesystem mount and 
> some daemons:
> http://pastebin.com/n1sNMhuE
> The third node cannot run resources and acts as a quorum node in standby.
>
> Recently, the nodes will all change to the "pending" state, and may remain 
> there for quite some time (many days) before coming back online (if ever). 
> Using "crm node clearstate" does not help.
>
> Tonight I stopped pacemaker and corosync on all nodes, emptied the contents 
> of /var/lib/pacemaker/cib, /var/lib/pacemaker/pengine, and /var/lib/corosync. 
> After doing so, I restarted corosync and pacemaker on all of the nodes, and 
> repopulated the CIB once the nodes all joined. This worked in restoring the 
> nodes states to "online", however after a few minutes, the nodes all went 
> back into "pending", this time only for around 5 minutes. Here's the log from 
> the current DC:
> http://pastebin.com/xhfsb15d
>
> There do not appear to be any faults in the corosync rings:
> RING ID 0
>         id      = 192.168.1.170
>         status  = ring 0 active with no faults
> RING ID 1
>         id      = 192.168.7.170
>         status  = ring 1 active with no faults
>
> corosync.conf:
> http://pastebin.com/DQUNdp9f
>
> Some common messages I am seeing in the log:
> Peer is not part of our cluster
> Diff 2.106.7 -> 2.106.8 from vcs1 not applied to 2.105.12: current "epoch" is 
> less than required (epoch, admin_epoch, and num_updates all appear in this 
> message)
> What do these messages mean? Do they indicate a problem?
>
> Do you have any ideas on what may be causing this behavior?
>
> Thanks,
>
> Andrew
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to