This sounds like a membership problem that people have reported previously.
I strongly suspect an upgrade to pacemaker 1.1.8 and the latest
corosync 1.4.x would fix it.


On Tue, Oct 30, 2012 at 4:18 PM, Jeff Johnson <[email protected]> wrote:
> Hello,
>
> I have four identical and separate pairs of corosync/pacemaker nodes.
> Two node pairs using multicast over ethernet on the same network, same
> rack, same switch.
>
> One node, according to crm_mon has every resource and ring as failed.
> The other node in the same pair sees everything good and the resources
> active in the correct locations, even the resources on the node that
> thinks everything is failed.
>
> If I kill corosync on the node that sees everything bad the stonith
> fence activates and kills the node seeing everything failed. Upon
> reboot the node still sees everything as failed and yet when it comes
> online the good node is able to transfer resources (fileystem mounts)
> to the node that sees everything as bad.
>
> I've tried corosync-cfgtool -r and it does nothing. crm resource
> cleanup <rsc> does nothing. I tried enabling an authkey on the nodes
> and the ring is still failed on one machine. All firewalls are
> disabled and I am able to ssh and pass other network traffic easily.
>
> Here is the output of crm_mon -1 -V from both nodes:
>
> Good node (node2):
> ============
> Last updated: Mon Oct 29 23:00:21 2012
> Last change: Sat Oct 27 16:58:22 2012 via cibadmin on node1
> Stack: openais
> Current DC: node2 - partition with quorum
> Version: 1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14
> 2 Nodes configured, 2 expected votes
> 6 Resources configured.
> ============
>
> Online: [ node1 node2 ]
>
>  resFS0000      (ocf::heartbeat:Filesystem):    Started node1
>  resFS0001      (ocf::heartbeat:Filesystem):    Started node1
>  resFS0002      (ocf::heartbeat:Filesystem):    Started node2
>  resFS0003      (ocf::heartbeat:Filesystem):    Started node2
>  ston-ipmi-node1        (stonith:fence_ipmilan):        Started node2
>  ston-ipmi-node2        (stonith:fence_ipmilan):        Started node1
>
>
> All rsc/ring failed node1
> ============
> Last updated: Mon Oct 29 23:00:46 2012
> Last change: Mon Oct 29 17:57:29 2012 via cibadmin on node1
> Stack: openais
> Current DC: NONE
> 2 Nodes configured, 2 expected votes
> 6 Resources configured.
> ============
>
> Node node1: UNCLEAN (offline)
> Node node2: UNCLEAN (offline)
>
> /var/log/cluster/corosync.log
> Oct 29 23:00:01 node1 lrmd: [2803]: info: rsc:resFS0000:12: monitor
> Oct 29 23:00:01 [2806] node1       crmd:     info: process_lrm_event:
>         LRM operation resFS0000_monitor_120000 (call=12, rc=0, cib-update=17,
> confirmed=false) ok
> Oct 29 23:00:01 [2801] node1        cib:  warning: cib_peer_callback:
>         Discarding cib_apply_diff message (963) from node2: not in our
> membership
> Oct 29 23:00:28 [2806] node1       crmd:  warning: cib_rsc_callback:
>         Resource update 7 failed: (rc=-41) Remote node did not respond
> Oct 29 23:01:28 [2806] node1       crmd:   notice:
> erase_xpath_callback:   Deletion of
> "//node_state[@uname='oss1']/transient_attributes": Remote node did
> not respond (rc=-41)
> Oct 29 23:01:30 [2804] node1      attrd:  warning: attrd_cib_callback:
>         Update 4 for probe_complete=true failed: Remote node did not respond
>
> It was working fine and then this. The other node pairs identically
> set up (different multicast addresses) are working fine.
>
> Ideas? I'm having a hard time coming up with a cause. Especially since
> the good node (node2) can see node1's status, move resources to node1
> successfully, take contol of them in even of fence operation on node1,
> etc. It is as if corosync is really healthy but some status file on
> node1 is stuck not refreshing or something.
>
> Thanks!
>
> --Jeff
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to