On Tue, Feb 24, 2015 at 2:07 AM, Andrew Beekhof <and...@beekhof.net> wrote:
> > > I have a 3-node cluster where node1 and node2 are running > corosync+pacemaker and node3 is running corosync only (for quorum). > Corosync 2.3.3, pacemaker 1.1.10. Everything worked fine the first couple > of days. > > > > Once upon a time I discovered the following situation: node2 thinks that > both node1 and node2 are online, but node1 thinks that node2 is down. Could > you please say: how could it be? There are no connectivity problems between > the nodes at the moment (maybe they were, but why the system hasn't > recovered?). > > The logs show connectivity problems occurring, so no doubt there. > As to why it hasn't recovered, first check corosync - if it does not have > a consistent view of the world pacemaker has no hope. > Alternatively, I recall there was a bug that could be preventing this in > your version. So if corosync looks fine, perhaps try an upgrade. Thanks. Are you talking about this bug: https://bugs.launchpad.net/ubuntu/+source/libqb/+bug/1341496 ? I believe I reproduced the problem one more time (it's very unstable), the symptoms were the following: 1. Once upon a time node2 became down. 2. The last message from corosync at node1 was "Quorum lost" (I suspect there was a temporary misconnection with node3). 3. Then, in a couple of days, at node3 "service corosync stop" hanged (only killall -9 helps). I tried to run strace during the service is stopping, it shows: [pid 19449] futex(0x7f580b4c62e0, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 19448] --- SIGQUIT {si_signo=SIGQUIT, si_code=SI_USER, si_pid=28183, si_uid=0} --- [pid 19448] write(6, "\3\0\0\0", 4) = 4 [pid 19448] rt_sigreturn() = 360 ... <and repeats for 19448 again and again> where pstree shows: init,1 ├─corosync,19448 │ └─{corosync},19449 4. As well as at node1: "service corosync stop" hangs at node1 too with same symptoms, only killall -9 helps. 5. Restarting corosync & pacemaker at node1 and node2 solved the problem. Could you please say is it related to the above bug in libqb? > > The "crm status" is below. What other logs should I attach for the > diagnostics? > > > > Also, "service corosync stop" on node1 hangs forever with no additional > lines in logs, so I cannot even stop the service. (But after "service > corosync stop" on node1 the node node2 starts thinking that node1 is > offline, although the command still hangs.) > > > > > > root@node2:~# crm status > > Current DC: node1 (1760315215) - partition with quorum > > 2 Nodes configured > > 6 Resources configured > > Online: [ node1 node2 ] > > Master/Slave Set: ms_drbd [drbd] > > Masters: [ node2 ] > > Slaves: [ node1 ] > > Resource Group: server > > fs (ocf::heartbeat:Filesystem): Started node2 > > postgresql (lsb:postgresql): Started node2 > > bind9 (lsb:bind9): Started node2 > > nginx (lsb:nginx): Started node2 > > > > > > root@node1:/var/log# crm status > > Current DC: node1 (1760315215) - partition with quorum > > 2 Nodes configured > > 6 Resources configured > > Online: [ node1 ] > > OFFLINE: [ node2 ] > > Master/Slave Set: ms_drbd [drbd] > > Masters: [ node1 ] > > Stopped: [ node2 ] > > Resource Group: server > > fs (ocf::heartbeat:Filesystem): Started node1 > > postgresql (lsb:postgresql): Started node1 > > bind9 (lsb:bind9): Started node1 > > nginx (lsb:nginx): Started node1 > > Failed actions: > > drbd_promote_0 (node=node1, call=634, rc=1, status=Timed Out, > last-rc-change=Thu Jan 22 10:30:08 2015, queued=20004ms, exec=0ms): unknown > error > > > > > > A part of "crm configure show": > > > > property $id="cib-bootstrap-options" \ > > dc-version="1.1.10-42f2063" \ > > cluster-infrastructure="corosync" \ > > stonith-enabled="false" \ > > last-lrm-refresh="1421250983" > > rsc_defaults $id="rsc-options" \ > > resource-stickiness="100" > > > > > > Also I see in logs on node1 (maybe they're related to the issue, maybe > not): > > > > Jan 22 10:14:02 node1 pengine[2772]: warning: pe_fence_node: Node node2 > is unclean because it is partially and/or un-expectedly down > > Jan 22 10:14:02 node1 pengine[2772]: warning: determine_online_status: > Node node2 is unclean > > Jan 22 10:14:02 node1 pengine[2772]: warning: stage6: Node node2 is > unclean! > > Jan 22 10:14:02 node1 pengine[2772]: warning: stage6: YOUR RESOURCES > ARE NOW LIKELY COMPROMISED > > Jan 22 10:14:02 node1 pengine[2772]: error: stage6: ENABLE STONITH TO > KEEP YOUR RESOURCES SAFE > > > > > > On node2 the logs are: > > > > Jan 22 10:13:57 node2 corosync[32761]: [TOTEM ] A new membership ( > 188.166.54.190:6276) was formed. Members left: 1760315215 13071578 > > Jan 22 10:13:57 node2 crmd[311]: notice: peer_update_callback: Our > peer on the DC is dead > > Jan 22 10:13:57 node2 crmd[311]: notice: do_state_transition: State > transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION > cause=C_CRMD_STATUS_CALLBACK origin=peer_update_callback ] > > Jan 22 10:13:57 node2 corosync[32761]: [QUORUM] This node is within the > non-primary component and will NOT provide any services. > > Jan 22 10:13:57 node2 corosync[32761]: [QUORUM] Members[1]: 1017525950 > > Jan 22 10:13:57 node2 crmd[311]: notice: pcmk_quorum_notification: > Membership 6276: quorum lost (1) > > Jan 22 10:13:57 node2 crmd[311]: notice: crm_update_peer_state: > pcmk_quorum_notification: Node (null)[13071578] - state is now lost (was > member) > > Jan 22 10:13:57 node2 crmd[311]: notice: crm_update_peer_state: > pcmk_quorum_notification: Node node1[1760315215] - state is now lost (was > member) > > Jan 22 10:13:57 node2 pacemakerd[302]: notice: > pcmk_quorum_notification: Membership 6276: quorum lost (1) > > Jan 22 10:13:57 node2 pacemakerd[302]: notice: crm_update_peer_state: > pcmk_quorum_notification: Node node1[1760315215] - state is now lost (was > member) > > Jan 22 10:13:57 node2 pacemakerd[302]: notice: crm_update_peer_state: > pcmk_quorum_notification: Node (null)[13071578] - state is now lost (was > member) > > Jan 22 10:13:57 node2 corosync[32761]: [MAIN ] Completed service > synchronization, ready to provide service. > > Jan 22 10:14:01 node2 corosync[32761]: [TOTEM ] A new membership ( > 104.236.71.79:6288) was formed. Members joined: 1760315215 13071578 > > Jan 22 10:14:02 node2 crmd[311]: error: pcmk_cpg_membership: Node > node1[1760315215] appears to be online even though we think it is dead > > Jan 22 10:14:02 node2 crmd[311]: notice: crm_update_peer_state: > pcmk_cpg_membership: Node node1[1760315215] - state is now member (was lost) > > Jan 22 10:14:03 node2 corosync[32761]: [QUORUM] This node is within the > primary component and will provide service. > > Jan 22 10:14:03 node2 corosync[32761]: [QUORUM] Members[3]: 1760315215 > 13071578 1017525950 > > Jan 22 10:14:03 node2 crmd[311]: notice: pcmk_quorum_notification: > Membership 6288: quorum acquired (3) > > Jan 22 10:14:03 node2 pacemakerd[302]: notice: > pcmk_quorum_notification: Membership 6288: quorum acquired (3) > > Jan 22 10:14:03 node2 pacemakerd[302]: notice: crm_update_peer_state: > pcmk_quorum_notification: Node node1[1760315215] - state is now member (was > lost) > > Jan 22 10:14:03 node2 corosync[32761]: [MAIN ] Completed service > synchronization, ready to provide service. > > Jan 22 10:14:03 node2 crmd[311]: notice: corosync_node_name: Unable to > get node name for nodeid 13071578 > > Jan 22 10:14:03 node2 crmd[311]: notice: crm_update_peer_state: > pcmk_quorum_notification: Node (null)[13071578] - state is now member (was > lost) > > Jan 22 10:14:03 node2 pacemakerd[302]: notice: corosync_node_name: > Unable to get node name for nodeid 13071578 > > Jan 22 10:14:03 node2 pacemakerd[302]: notice: crm_update_peer_state: > pcmk_quorum_notification: Node (null)[13071578] - state is now member (was > lost) > > Jan 22 10:14:03 node2 crmd[311]: warning: do_log: FSA: Input > I_JOIN_OFFER from route_message() received in state S_ELECTION > > Jan 22 10:14:04 node2 crmd[311]: notice: do_state_transition: State > transition S_ELECTION -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL > origin=do_election_count_vote ] > > Jan 22 10:14:05 node2 attrd[310]: notice: attrd_local_callback: > Sending full refresh (origin=crmd) > > Jan 22 10:14:05 node2 attrd[310]: notice: attrd_trigger_update: > Sending flush op to all hosts for: master-drbd (10000) > > Jan 22 10:14:05 node2 attrd[310]: notice: attrd_trigger_update: > Sending flush op to all hosts for: probe_complete (true) > > Jan 22 10:14:05 node2 crmd[311]: notice: do_state_transition: State > transition S_PENDING -> S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE > origin=do_cl_join_finalize_respond ] > > Jan 22 10:15:11 node2 corosync[32761]: [TOTEM ] A new membership ( > 104.236.71.79:6296) was formed. Members left: 13071578 > > Jan 22 10:15:14 node2 corosync[32761]: [TOTEM ] A new membership ( > 128.199.116.218:6312) was formed. Members joined: 13071578 left: > 1760315215 > > Jan 22 10:15:17 node2 corosync[32761]: [TOTEM ] A new membership ( > 104.236.71.79:6324) was formed. Members joined: 1760315215 > > Jan 22 10:15:19 node2 crmd[311]: notice: peer_update_callback: Our > peer on the DC is dead > > Jan 22 10:15:19 node2 crmd[311]: notice: do_state_transition: State > transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION > cause=C_CRMD_STATUS_CALLBACK origin=peer_update_callback ] > > Jan 22 10:15:20 node2 kernel: [690741.179442] block drbd0: peer( Primary > -> Secondary ) > > Jan 22 10:15:20 node2 corosync[32761]: [QUORUM] Members[3]: 1760315215 > 13071578 1017525950 > > Jan 22 10:15:20 node2 corosync[32761]: [MAIN ] Completed service > synchronization, ready to provide service. > > > > _______________________________________________ > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > Project Home: http://www.clusterlabs.org > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: http://bugs.clusterlabs.org > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org >
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org