This sounds like a membership problem that people have reported previously. I strongly suspect an upgrade to pacemaker 1.1.8 and the latest corosync 1.4.x would fix it.
On Tue, Oct 30, 2012 at 4:18 PM, Jeff Johnson <[email protected]> wrote: > Hello, > > I have four identical and separate pairs of corosync/pacemaker nodes. > Two node pairs using multicast over ethernet on the same network, same > rack, same switch. > > One node, according to crm_mon has every resource and ring as failed. > The other node in the same pair sees everything good and the resources > active in the correct locations, even the resources on the node that > thinks everything is failed. > > If I kill corosync on the node that sees everything bad the stonith > fence activates and kills the node seeing everything failed. Upon > reboot the node still sees everything as failed and yet when it comes > online the good node is able to transfer resources (fileystem mounts) > to the node that sees everything as bad. > > I've tried corosync-cfgtool -r and it does nothing. crm resource > cleanup <rsc> does nothing. I tried enabling an authkey on the nodes > and the ring is still failed on one machine. All firewalls are > disabled and I am able to ssh and pass other network traffic easily. > > Here is the output of crm_mon -1 -V from both nodes: > > Good node (node2): > ============ > Last updated: Mon Oct 29 23:00:21 2012 > Last change: Sat Oct 27 16:58:22 2012 via cibadmin on node1 > Stack: openais > Current DC: node2 - partition with quorum > Version: 1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14 > 2 Nodes configured, 2 expected votes > 6 Resources configured. > ============ > > Online: [ node1 node2 ] > > resFS0000 (ocf::heartbeat:Filesystem): Started node1 > resFS0001 (ocf::heartbeat:Filesystem): Started node1 > resFS0002 (ocf::heartbeat:Filesystem): Started node2 > resFS0003 (ocf::heartbeat:Filesystem): Started node2 > ston-ipmi-node1 (stonith:fence_ipmilan): Started node2 > ston-ipmi-node2 (stonith:fence_ipmilan): Started node1 > > > All rsc/ring failed node1 > ============ > Last updated: Mon Oct 29 23:00:46 2012 > Last change: Mon Oct 29 17:57:29 2012 via cibadmin on node1 > Stack: openais > Current DC: NONE > 2 Nodes configured, 2 expected votes > 6 Resources configured. > ============ > > Node node1: UNCLEAN (offline) > Node node2: UNCLEAN (offline) > > /var/log/cluster/corosync.log > Oct 29 23:00:01 node1 lrmd: [2803]: info: rsc:resFS0000:12: monitor > Oct 29 23:00:01 [2806] node1 crmd: info: process_lrm_event: > LRM operation resFS0000_monitor_120000 (call=12, rc=0, cib-update=17, > confirmed=false) ok > Oct 29 23:00:01 [2801] node1 cib: warning: cib_peer_callback: > Discarding cib_apply_diff message (963) from node2: not in our > membership > Oct 29 23:00:28 [2806] node1 crmd: warning: cib_rsc_callback: > Resource update 7 failed: (rc=-41) Remote node did not respond > Oct 29 23:01:28 [2806] node1 crmd: notice: > erase_xpath_callback: Deletion of > "//node_state[@uname='oss1']/transient_attributes": Remote node did > not respond (rc=-41) > Oct 29 23:01:30 [2804] node1 attrd: warning: attrd_cib_callback: > Update 4 for probe_complete=true failed: Remote node did not respond > > It was working fine and then this. The other node pairs identically > set up (different multicast addresses) are working fine. > > Ideas? I'm having a hard time coming up with a cause. Especially since > the good node (node2) can see node1's status, move resources to node1 > successfully, take contol of them in even of fence operation on node1, > etc. It is as if corosync is really healthy but some status file on > node1 is stuck not refreshing or something. > > Thanks! > > --Jeff > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
