Hi, On Wed, Sep 21, 2011 at 02:05:54PM +0200, Sascha Hagedorn wrote: > Hi everyone, > > I am experiencing strange problems on a two node cluster. The nodes are > virtual machines on a XEN 5.6 SP2 host. I have set up a simple cluster with > no resources defined. After I setup the configuration and commit it to the > cluster the message [TOTEM ] FAILED TO RECEIVE comes up and shortly after > that the AIS connection is lost and I cannot access the crm anymore.
It seems like corosync crashed. You should post to the corosync ML, http://lists.corosync.org/mailman/listinfo, if possible with the backtrace. Thanks, Dejan > Here is the output: > > Sep 21 13:41:04 node1 cib: [11499]: info: cib_replace_notify: Replaced: > 0.5.23 -> 0.6.1 from <null> > Sep 21 13:41:04 node1 crmd: [11503]: info: abort_transition_graph: > te_update_diff:124 - Triggered transition abort (complete=1, tag=diff, > id=(null), magic=NA, cib=0.6.1) : Non-status change > Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff- <cib admin_epoch="0" > epoch="5" num_updates="23" /> > Sep 21 13:41:04 node1 crmd: [11503]: info: do_state_transition: State > transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL > origin=abort_transition_graph ] > Sep 21 13:41:04 node1 attrd: [11501]: info: do_cib_replaced: Sending full > refresh > Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+ <cib epoch="6" > num_updates="1" admin_epoch="0" validate-with="pacemaker-1.2" > crm_feature_set="3.0.5" have-quorum="1" dc-uuid="node1" > > Sep 21 13:41:04 node1 crmd: [11503]: info: do_state_transition: All 2 cluster > nodes are eligible to run resources. > Sep 21 13:41:04 node1 attrd: [11501]: info: attrd_trigger_update: Sending > flush op to all hosts for: probe_complete (true) > Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+ <configuration > > Sep 21 13:41:04 node1 crmd: [11503]: info: do_pe_invoke: Query 70: Requesting > the current CIB: S_POLICY_ENGINE > Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+ <crm_config > > Sep 21 13:41:04 node1 crmd: [11503]: info: do_state_transition: State > transition S_POLICY_ENGINE -> S_ELECTION [ input=I_ELECTION > cause=C_FSA_INTERNAL origin=do_cib_replaced ] > Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+ > <cluster_property_set id="cib-bootstrap-options" > > Sep 21 13:41:04 node1 crmd: [11503]: info: update_dc: Unset DC node1 > Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+ <nvpair > id="cib-bootstrap-options-stonith-enabled" name="stonith-enabled" > value="false" __crm_diff_marker__="added:top" /> > Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+ <nvpair > id="cib-bootstrap-options-no-quorum-policy" name="no-quorum-policy" > value="ignore" __crm_diff_marker__="added:top" /> > Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+ > </cluster_property_set> > Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+ </crm_config> > Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+ <rsc_defaults > __crm_diff_marker__="added:top" > > Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+ <meta_attributes > id="rsc-options" > > Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+ <nvpair > id="rsc-options-resource-stickiness" name="resource-stickiness" value="100" /> > Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+ </meta_attributes> > Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+ </rsc_defaults> > Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+ </configuration> > Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+ </cib> > Sep 21 13:41:04 node1 cib: [11499]: info: cib_process_request: Operation > complete: op cib_replace for section 'all' (origin=local/cibadmin/2, > version=0.6.1): ok (rc=0) > Sep 21 13:41:04 node1 cib: [11499]: info: cib_process_request: Operation > complete: op cib_modify for section nodes (origin=local/crmd/68, > version=0.6.2): ok (rc=0) > Sep 21 13:41:04 node1 crmd: [11503]: info: config_query_callback: Shutdown > escalation occurs after: 1200000ms > Sep 21 13:41:04 node1 crmd: [11503]: info: config_query_callback: Checking > for expired actions every 900000ms > Sep 21 13:41:04 node1 corosync[11485]: [TOTEM ] Retransmit List: 7a 7b 7c > Sep 21 13:41:04 node1 crmd: [11503]: info: config_query_callback: Sending > expected-votes=2 to corosync > Sep 21 13:41:04 node1 crmd: [11503]: info: ais_dispatch_message: Membership > 208: quorum retained > Sep 21 13:41:04 node1 crmd: [11503]: info: crmd_ais_dispatch: Setting > expected votes to 2 > Sep 21 13:41:04 node1 cib: [11499]: info: cib_process_request: Operation > complete: op cib_modify for section crm_config (origin=local/crmd/73, > version=0.6.5): ok (rc=0) > Sep 21 13:41:04 node1 corosync[11485]: [TOTEM ] Retransmit List: 7a 7b 7c 7d > 7e > Sep 21 13:41:04 node1 corosync[11485]: [TOTEM ] Retransmit List: 7a 7b 7c 7d > 7e 7f 80 > Sep 21 13:41:04 node1 corosync[11485]: [TOTEM ] Retransmit List: 7a 7b 7c 7d > 7e 7f 80 81 > Sep 21 13:41:04 node1 corosync[11485]: [TOTEM ] Retransmit List: 7a 7b 7c 7d > 7e 7f 80 81 > Sep 21 13:41:04 node1 corosync[11485]: [TOTEM ] Retransmit List: 7a 7b 7c 7d > 7e 7f 80 81 > Sep 21 13:41:04 node1 corosync[11485]: [TOTEM ] Retransmit List: 7a 7b 7c 7d > 7e 7f 80 81 82 83 84 > Sep 21 13:41:04 node1 corosync[11485]: [TOTEM ] Retransmit List: 7a 7b 7c 7d > 7e 7f 80 81 82 83 84 > Sep 21 13:41:04 node1 corosync[11485]: [TOTEM ] Retransmit List: 7a 7b 7c 7d > 7e 7f 80 81 82 83 84 > Sep 21 13:41:04 node1 corosync[11485]: [TOTEM ] Retransmit List: 7a 7b 7c 7d > 7e 7f 80 81 82 83 84 > Sep 21 13:41:04 node1 corosync[11485]: [TOTEM ] Retransmit List: 7a 7b 7c 7d > 7e 7f 80 81 82 83 84 > Sep 21 13:41:04 node1 corosync[11485]: [TOTEM ] Retransmit List: 7a 7b 7c 7d > 7e 7f 80 81 82 83 84 > Sep 21 13:41:04 node1 corosync[11485]: [TOTEM ] Retransmit List: 7a 7b 7c 7d > 7e 7f 80 81 82 83 84 > Sep 21 13:41:04 node1 corosync[11485]: [TOTEM ] Retransmit List: 7a 7b 7c 7d > 7e 7f 80 81 82 83 84 > Sep 21 13:41:04 node1 corosync[11485]: [TOTEM ] Retransmit List: 7a 7b 7c 7d > 7e 7f 80 81 82 83 84 > Sep 21 13:41:04 node1 corosync[11485]: [TOTEM ] Retransmit List: 7a 7b 7c 7d > 7e 7f 80 81 82 83 84 > Sep 21 13:41:04 node1 corosync[11485]: [TOTEM ] Retransmit List: 7a 7b 7c 7d > 7e 7f 80 81 82 83 84 > Sep 21 13:41:04 node1 corosync[11485]: [TOTEM ] Retransmit List: 7a 7b 7c 7d > 7e 7f 80 81 82 83 84 > Sep 21 13:41:04 node1 corosync[11485]: [TOTEM ] Retransmit List: 7a 7b 7c 7d > 7e 7f 80 81 82 83 84 > Sep 21 13:41:04 node1 corosync[11485]: [TOTEM ] Retransmit List: 7a 7b 7c 7d > 7e 7f 80 81 82 83 84 > Sep 21 13:41:04 node1 corosync[11485]: [TOTEM ] Retransmit List: 7a 7b 7c 7d > 7e 7f 80 81 82 83 84 > Sep 21 13:41:04 node1 corosync[11485]: [TOTEM ] Retransmit List: 7a 7b 7c 7d > 7e 7f 80 81 82 83 84 > Sep 21 13:41:04 node1 corosync[11485]: [TOTEM ] Retransmit List: 7a 7b 7c 7d > 7e 7f 80 81 82 83 84 > Sep 21 13:41:04 node1 corosync[11485]: [TOTEM ] Retransmit List: 7a 7b 7c 7d > 7e 7f 80 81 82 83 84 > Sep 21 13:41:04 node1 corosync[11485]: [TOTEM ] Retransmit List: 7a 7b 7c 7d > 7e 7f 80 81 82 83 84 > Sep 21 13:41:04 node1 corosync[11485]: [TOTEM ] Retransmit List: 7a 7b 7c 7d > 7e 7f 80 81 82 83 84 > Sep 21 13:41:04 node1 corosync[11485]: [TOTEM ] Retransmit List: 7a 7b 7c 7d > 7e 7f 80 81 82 83 84 > Sep 21 13:41:04 node1 corosync[11485]: [TOTEM ] Retransmit List: 7a 7b 7c 7d > 7e 7f 80 81 82 83 84 > Sep 21 13:41:04 node1 corosync[11485]: [TOTEM ] Retransmit List: 7a 7b 7c 7d > 7e 7f 80 81 82 83 84 > Sep 21 13:41:04 node1 corosync[11485]: [TOTEM ] Retransmit List: 7a 7b 7c 7d > 7e 7f 80 81 82 83 84 > Sep 21 13:41:04 node1 corosync[11485]: [TOTEM ] Retransmit List: 7a 7b 7c 7d > 7e 7f 80 81 82 83 84 > Sep 21 13:41:04 node1 corosync[11485]: [TOTEM ] Retransmit List: 7a 7b 7c 7d > 7e 7f 80 81 82 83 84 > Sep 21 13:41:04 node1 corosync[11485]: [TOTEM ] Retransmit List: 7a 7b 7c 7d > 7e 7f 80 81 82 83 84 > Sep 21 13:41:04 node1 corosync[11485]: [TOTEM ] Retransmit List: 7a 7b 7c 7d > 7e 7f 80 81 82 83 84 > Sep 21 13:41:04 node1 corosync[11485]: [TOTEM ] Retransmit List: 7a 7b 7c 7d > 7e 7f 80 81 82 83 84 > Sep 21 13:41:05 node1 corosync[11485]: [TOTEM ] Retransmit List: 7a 7b 7c 7d > 7e 7f 80 81 82 83 84 > Sep 21 13:41:05 node1 corosync[11485]: [TOTEM ] Retransmit List: 7a 7b 7c 7d > 7e 7f 80 81 82 83 84 > Sep 21 13:41:05 node1 corosync[11485]: [TOTEM ] Retransmit List: 7a 7b 7c 7d > 7e 7f 80 81 82 83 84 > Sep 21 13:41:06 node1 corosync[11485]: [TOTEM ] Retransmit List: 7a 7b 7c 7d > 7e 7f 80 81 82 83 84 > Sep 21 13:41:06 node1 corosync[11485]: [TOTEM ] Retransmit List: 7a 7b 7c 7d > 7e 7f 80 81 82 83 84 > Sep 21 13:41:07 node1 corosync[11485]: [TOTEM ] Retransmit List: 7a 7b 7c 7d > 7e 7f 80 81 82 83 84 > Sep 21 13:41:07 node1 corosync[11485]: [TOTEM ] Retransmit List: 7a 7b 7c 7d > 7e 7f 80 81 82 83 84 > Sep 21 13:41:07 node1 corosync[11485]: [TOTEM ] Retransmit List: 7a 7b 7c 7d > 7e 7f 80 81 82 83 84 > Sep 21 13:41:08 node1 corosync[11485]: [TOTEM ] Retransmit List: 7a 7b 7c 7d > 7e 7f 80 81 82 83 84 > Sep 21 13:41:08 node1 corosync[11485]: [TOTEM ] Retransmit List: 7a 7b 7c 7d > 7e 7f 80 81 82 83 84 > Sep 21 13:41:09 node1 corosync[11485]: [TOTEM ] Retransmit List: 7a 7b 7c 7d > 7e 7f 80 81 82 83 84 > Sep 21 13:41:09 node1 corosync[11485]: [TOTEM ] Retransmit List: 7a 7b 7c 7d > 7e 7f 80 81 82 83 84 > Sep 21 13:41:17 node1 corosync[11485]: [TOTEM ] FAILED TO RECEIVE > Sep 21 13:41:26 node1 stonith-ng: [11498]: ERROR: ais_dispatch: Receiving > message body failed: (2) Library error: Resource temporarily unavailable (11) > Sep 21 13:41:26 node1 crmd: [11503]: ERROR: ais_dispatch: Receiving message > body failed: (2) Library error: Resource temporarily unavailable (11) > Sep 21 13:41:26 node1 attrd: [11501]: ERROR: ais_dispatch: Receiving message > body failed: (2) Library error: Resource temporarily unavailable (11) > Sep 21 13:41:26 node1 cib: [11499]: ERROR: ais_dispatch: Receiving message > body failed: (2) Library error: Resource temporarily unavailable (11) > Sep 21 13:41:26 node1 stonith-ng: [11498]: ERROR: ais_dispatch: AIS > connection failed > Sep 21 13:41:26 node1 crmd: [11503]: ERROR: ais_dispatch: AIS connection > failed > Sep 21 13:41:26 node1 attrd: [11501]: ERROR: ais_dispatch: AIS connection > failed > Sep 21 13:41:26 node1 cib: [11499]: ERROR: ais_dispatch: AIS connection failed > Sep 21 13:41:26 node1 stonith-ng: [11498]: ERROR: stonith_peer_ais_destroy: > AIS connection terminated > Sep 21 13:41:26 node1 crmd: [11503]: info: crmd_ais_destroy: connection closed > Sep 21 13:41:26 node1 attrd: [11501]: CRIT: attrd_ais_destroy: Lost > connection to OpenAIS service! > Sep 21 13:41:26 node1 cib: [11499]: ERROR: cib_ais_destroy: AIS connection > terminated > Sep 21 13:41:26 node1 crmd: [11503]: info: cib_native_msgready: Lost > connection to the CIB service [11499]. > Sep 21 13:41:26 node1 attrd: [11501]: notice: main: Exiting... > Sep 21 13:41:26 node1 crmd: [11503]: CRIT: cib_native_dispatch: Lost > connection to the CIB service [11499/callback]. > Sep 21 13:41:26 node1 attrd: [11501]: ERROR: attrd_cib_connection_destroy: > Connection to the CIB terminated... > Sep 21 13:41:26 node1 crmd: [11503]: CRIT: cib_native_dispatch: Lost > connection to the CIB service [11499/command]. > Sep 21 13:41:26 node1 crmd: [11503]: ERROR: crmd_cib_connection_destroy: > Connection to the CIB terminated... > Sep 21 13:41:26 node1 crmd: [11503]: info: stonith_msgready: Lost connection > to the STONITH service [11498]. > Sep 21 13:41:26 node1 crmd: [11503]: CRIT: stonith_dispatch: Lost connection > to the STONITH service [11498/callback]. > Sep 21 13:41:26 node1 crmd: [11503]: CRIT: stonith_dispatch: Lost connection > to the STONITH service [11498/command]. > Sep 21 13:41:26 node1 crmd: [11503]: CRIT: > tengine_stonith_connection_destroy: Fencing daemon connection failed > Sep 21 13:41:26 node1 crmd: [11503]: ERROR: do_log: FSA: Input I_ERROR from > crmd_cib_connection_destroy() received in state S_ELECTION > Sep 21 13:41:26 node1 crmd: [11503]: info: do_state_transition: State > transition S_ELECTION -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL > origin=crmd_cib_connection_destroy ] > Sep 21 13:41:26 node1 crmd: [11503]: ERROR: do_recover: Action A_RECOVER > (0000000001000000) not supported > Sep 21 13:41:26 node1 crmd: [11503]: info: do_dc_release: DC role released > Sep 21 13:41:26 node1 crmd: [11503]: info: pe_connection_destroy: Connection > to the Policy Engine released > Sep 21 13:41:26 node1 crmd: [11503]: info: do_te_control: Transitioner is now > inactive > Sep 21 13:41:26 node1 crmd: [11503]: ERROR: do_log: FSA: Input I_TERMINATE > from do_recover() received in state S_RECOVERY > Sep 21 13:41:26 node1 crmd: [11503]: info: do_state_transition: State > transition S_RECOVERY -> S_TERMINATE [ input=I_TERMINATE cause=C_FSA_INTERNAL > origin=do_recover ] > Sep 21 13:41:26 node1 crmd: [11503]: info: do_shutdown: Disconnecting > STONITH... > Sep 21 13:41:26 node1 crmd: [11503]: info: do_lrm_control: Disconnected from > the LRM > Sep 21 13:41:26 node1 crmd: [11503]: info: do_ha_control: Disconnected from > OpenAIS > Sep 21 13:41:26 node1 crmd: [11503]: info: do_cib_control: Disconnecting CIB > Sep 21 13:41:26 node1 crmd: [11503]: info: do_exit: Performing A_EXIT_0 - > gracefully exiting the CRMd > Sep 21 13:41:26 node1 crmd: [11503]: ERROR: do_exit: Could not recover from > internal error > Sep 21 13:41:26 node1 crmd: [11503]: info: free_mem: Dropping > I_RELEASE_SUCCESS: [ state=S_TERMINATE cause=C_FSA_INTERNAL > origin=do_dc_release ] > Sep 21 13:41:26 node1 crmd: [11503]: info: free_mem: Dropping I_TERMINATE: [ > state=S_TERMINATE cause=C_FSA_INTERNAL origin=do_stop ] > Sep 21 13:41:26 node1 crmd: [11503]: info: crm_xml_cleanup: Cleaning up > memory from libxml2 > Sep 21 13:41:26 node1 crmd: [11503]: info: do_exit: [crmd] stopped (2) > > This is the configuration, nothing fancy here: > > node node1 > node node2 > property $id="cib-bootstrap-options" \ > dc-version="1.1.5-5bd2b9154d7d9f86d7f56fe0a74072a5a6590c60" \ > cluster-infrastructure="openais" \ > expected-quorum-votes="2" \ > stonith-enabled="false" \ > no-quorum-policy="ignore" > rsc_defaults $id="rsc-options" \ > resource-stickiness="100" > > I am running corosync 1.3.1, openais 1.1.4 and pacemaker 1.1.5. > > I see this a lot too on a different cluster on this XEN host. > > What does "[TOTEM ] FAILED TO RECEIVE" exactly mean and why does the cluster > get so unstable after that? > > Thank you, > Sascha > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
