Hello Dejan, well, actually the nodes communicate over a virtual network device since they are virtual machines on a XEN host. No switches or hardware involved as far as I know.
Regards, Sascha > -----Ursprüngliche Nachricht----- > Von: [email protected] [mailto:linux-ha- > [email protected]] Im Auftrag von Dejan Muhamedagic > Gesendet: Donnerstag, 22. September 2011 12:25 > An: General Linux-HA mailing list > Betreff: Re: [Linux-HA] crm crashes after: TOTEM failed to receive > > Hi, > > On Thu, Sep 22, 2011 at 09:23:11AM +0200, Sascha Hagedorn wrote: > > Hi Dejan, > > > > thank you for your quick response. I will post this on the corosync > mailing list as well. But is this "TOTEM Failed to receive" message an > indicator that the multicast communication between the two is somehow > erroneous? > > IIRC, yes. You should probably check how your switch is handling > multicast. > > Thanks, > > Dejan > > > Regards, > > Sascha > > > > > -----Ursprüngliche Nachricht----- > > > Von: [email protected] [mailto:linux-ha- > > > [email protected]] Im Auftrag von Dejan Muhamedagic > > > Gesendet: Mittwoch, 21. September 2011 14:19 > > > An: General Linux-HA mailing list > > > Betreff: Re: [Linux-HA] crm crashes after: TOTEM failed to receive > > > > > > Hi, > > > > > > On Wed, Sep 21, 2011 at 02:05:54PM +0200, Sascha Hagedorn wrote: > > > > Hi everyone, > > > > > > > > I am experiencing strange problems on a two node cluster. The > nodes > > > are virtual machines on a XEN 5.6 SP2 host. I have set up a simple > > > cluster with no resources defined. After I setup the configuration > and > > > commit it to the cluster the message [TOTEM ] FAILED TO RECEIVE > comes > > > up and shortly after that the AIS connection is lost and I cannot > > > access the crm anymore. > > > > > > It seems like corosync crashed. You should post to the corosync > > > ML, http://lists.corosync.org/mailman/listinfo, if possible with > > > the backtrace. > > > > > > Thanks, > > > > > > Dejan > > > > > > > Here is the output: > > > > > > > > Sep 21 13:41:04 node1 cib: [11499]: info: cib_replace_notify: > > > Replaced: 0.5.23 -> 0.6.1 from <null> > > > > Sep 21 13:41:04 node1 crmd: [11503]: info: > abort_transition_graph: > > > te_update_diff:124 - Triggered transition abort (complete=1, > tag=diff, > > > id=(null), magic=NA, cib=0.6.1) : Non-status change > > > > Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff- <cib > > > admin_epoch="0" epoch="5" num_updates="23" /> > > > > Sep 21 13:41:04 node1 crmd: [11503]: info: do_state_transition: > State > > > transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC > > > cause=C_FSA_INTERNAL origin=abort_transition_graph ] > > > > Sep 21 13:41:04 node1 attrd: [11501]: info: do_cib_replaced: > Sending > > > full refresh > > > > Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+ <cib > epoch="6" > > > num_updates="1" admin_epoch="0" validate-with="pacemaker-1.2" > > > crm_feature_set="3.0.5" have-quorum="1" dc-uuid="node1" > > > > > Sep 21 13:41:04 node1 crmd: [11503]: info: do_state_transition: > All 2 > > > cluster nodes are eligible to run resources. > > > > Sep 21 13:41:04 node1 attrd: [11501]: info: attrd_trigger_update: > > > Sending flush op to all hosts for: probe_complete (true) > > > > Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+ > <configuration > > > > > > > > Sep 21 13:41:04 node1 crmd: [11503]: info: do_pe_invoke: Query > 70: > > > Requesting the current CIB: S_POLICY_ENGINE > > > > Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+ > <crm_config > > > > > Sep 21 13:41:04 node1 crmd: [11503]: info: do_state_transition: > State > > > transition S_POLICY_ENGINE -> S_ELECTION [ input=I_ELECTION > > > cause=C_FSA_INTERNAL origin=do_cib_replaced ] > > > > Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+ > > > <cluster_property_set id="cib-bootstrap-options" > > > > > Sep 21 13:41:04 node1 crmd: [11503]: info: update_dc: Unset DC > node1 > > > > Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+ > <nvpair > > > id="cib-bootstrap-options-stonith-enabled" name="stonith-enabled" > > > value="false" __crm_diff_marker__="added:top" /> > > > > Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+ > <nvpair > > > id="cib-bootstrap-options-no-quorum-policy" name="no-quorum-policy" > > > value="ignore" __crm_diff_marker__="added:top" /> > > > > Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+ > > > </cluster_property_set> > > > > Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+ > </crm_config> > > > > Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+ > <rsc_defaults > > > __crm_diff_marker__="added:top" > > > > > Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+ > > > <meta_attributes id="rsc-options" > > > > > Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+ > <nvpair > > > id="rsc-options-resource-stickiness" name="resource-stickiness" > > > value="100" /> > > > > Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+ > > > </meta_attributes> > > > > Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+ > > > </rsc_defaults> > > > > Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+ > > > </configuration> > > > > Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+ </cib> > > > > Sep 21 13:41:04 node1 cib: [11499]: info: cib_process_request: > > > Operation complete: op cib_replace for section 'all' > > > (origin=local/cibadmin/2, version=0.6.1): ok (rc=0) > > > > Sep 21 13:41:04 node1 cib: [11499]: info: cib_process_request: > > > Operation complete: op cib_modify for section nodes > > > (origin=local/crmd/68, version=0.6.2): ok (rc=0) > > > > Sep 21 13:41:04 node1 crmd: [11503]: info: config_query_callback: > > > Shutdown escalation occurs after: 1200000ms > > > > Sep 21 13:41:04 node1 crmd: [11503]: info: config_query_callback: > > > Checking for expired actions every 900000ms > > > > Sep 21 13:41:04 node1 corosync[11485]: [TOTEM ] Retransmit List: > 7a > > > 7b 7c > > > > Sep 21 13:41:04 node1 crmd: [11503]: info: config_query_callback: > > > Sending expected-votes=2 to corosync > > > > Sep 21 13:41:04 node1 crmd: [11503]: info: ais_dispatch_message: > > > Membership 208: quorum retained > > > > Sep 21 13:41:04 node1 crmd: [11503]: info: crmd_ais_dispatch: > Setting > > > expected votes to 2 > > > > Sep 21 13:41:04 node1 cib: [11499]: info: cib_process_request: > > > Operation complete: op cib_modify for section crm_config > > > (origin=local/crmd/73, version=0.6.5): ok (rc=0) > > > > Sep 21 13:41:04 node1 corosync[11485]: [TOTEM ] Retransmit List: > 7a > > > 7b 7c 7d 7e > > > > Sep 21 13:41:04 node1 corosync[11485]: [TOTEM ] Retransmit List: > 7a > > > 7b 7c 7d 7e 7f 80 > > > > Sep 21 13:41:04 node1 corosync[11485]: [TOTEM ] Retransmit List: > 7a > > > 7b 7c 7d 7e 7f 80 81 > > > > Sep 21 13:41:04 node1 corosync[11485]: [TOTEM ] Retransmit List: > 7a > > > 7b 7c 7d 7e 7f 80 81 > > > > Sep 21 13:41:04 node1 corosync[11485]: [TOTEM ] Retransmit List: > 7a > > > 7b 7c 7d 7e 7f 80 81 > > > > Sep 21 13:41:04 node1 corosync[11485]: [TOTEM ] Retransmit List: > 7a > > > 7b 7c 7d 7e 7f 80 81 82 83 84 > > > > Sep 21 13:41:04 node1 corosync[11485]: [TOTEM ] Retransmit List: > 7a > > > 7b 7c 7d 7e 7f 80 81 82 83 84 > > > > Sep 21 13:41:04 node1 corosync[11485]: [TOTEM ] Retransmit List: > 7a > > > 7b 7c 7d 7e 7f 80 81 82 83 84 > > > > Sep 21 13:41:04 node1 corosync[11485]: [TOTEM ] Retransmit List: > 7a > > > 7b 7c 7d 7e 7f 80 81 82 83 84 > > > > Sep 21 13:41:04 node1 corosync[11485]: [TOTEM ] Retransmit List: > 7a > > > 7b 7c 7d 7e 7f 80 81 82 83 84 > > > > Sep 21 13:41:04 node1 corosync[11485]: [TOTEM ] Retransmit List: > 7a > > > 7b 7c 7d 7e 7f 80 81 82 83 84 > > > > Sep 21 13:41:04 node1 corosync[11485]: [TOTEM ] Retransmit List: > 7a > > > 7b 7c 7d 7e 7f 80 81 82 83 84 > > > > Sep 21 13:41:04 node1 corosync[11485]: [TOTEM ] Retransmit List: > 7a > > > 7b 7c 7d 7e 7f 80 81 82 83 84 > > > > Sep 21 13:41:04 node1 corosync[11485]: [TOTEM ] Retransmit List: > 7a > > > 7b 7c 7d 7e 7f 80 81 82 83 84 > > > > Sep 21 13:41:04 node1 corosync[11485]: [TOTEM ] Retransmit List: > 7a > > > 7b 7c 7d 7e 7f 80 81 82 83 84 > > > > Sep 21 13:41:04 node1 corosync[11485]: [TOTEM ] Retransmit List: > 7a > > > 7b 7c 7d 7e 7f 80 81 82 83 84 > > > > Sep 21 13:41:04 node1 corosync[11485]: [TOTEM ] Retransmit List: > 7a > > > 7b 7c 7d 7e 7f 80 81 82 83 84 > > > > Sep 21 13:41:04 node1 corosync[11485]: [TOTEM ] Retransmit List: > 7a > > > 7b 7c 7d 7e 7f 80 81 82 83 84 > > > > Sep 21 13:41:04 node1 corosync[11485]: [TOTEM ] Retransmit List: > 7a > > > 7b 7c 7d 7e 7f 80 81 82 83 84 > > > > Sep 21 13:41:04 node1 corosync[11485]: [TOTEM ] Retransmit List: > 7a > > > 7b 7c 7d 7e 7f 80 81 82 83 84 > > > > Sep 21 13:41:04 node1 corosync[11485]: [TOTEM ] Retransmit List: > 7a > > > 7b 7c 7d 7e 7f 80 81 82 83 84 > > > > Sep 21 13:41:04 node1 corosync[11485]: [TOTEM ] Retransmit List: > 7a > > > 7b 7c 7d 7e 7f 80 81 82 83 84 > > > > Sep 21 13:41:04 node1 corosync[11485]: [TOTEM ] Retransmit List: > 7a > > > 7b 7c 7d 7e 7f 80 81 82 83 84 > > > > Sep 21 13:41:04 node1 corosync[11485]: [TOTEM ] Retransmit List: > 7a > > > 7b 7c 7d 7e 7f 80 81 82 83 84 > > > > Sep 21 13:41:04 node1 corosync[11485]: [TOTEM ] Retransmit List: > 7a > > > 7b 7c 7d 7e 7f 80 81 82 83 84 > > > > Sep 21 13:41:04 node1 corosync[11485]: [TOTEM ] Retransmit List: > 7a > > > 7b 7c 7d 7e 7f 80 81 82 83 84 > > > > Sep 21 13:41:04 node1 corosync[11485]: [TOTEM ] Retransmit List: > 7a > > > 7b 7c 7d 7e 7f 80 81 82 83 84 > > > > Sep 21 13:41:04 node1 corosync[11485]: [TOTEM ] Retransmit List: > 7a > > > 7b 7c 7d 7e 7f 80 81 82 83 84 > > > > Sep 21 13:41:04 node1 corosync[11485]: [TOTEM ] Retransmit List: > 7a > > > 7b 7c 7d 7e 7f 80 81 82 83 84 > > > > Sep 21 13:41:04 node1 corosync[11485]: [TOTEM ] Retransmit List: > 7a > > > 7b 7c 7d 7e 7f 80 81 82 83 84 > > > > Sep 21 13:41:04 node1 corosync[11485]: [TOTEM ] Retransmit List: > 7a > > > 7b 7c 7d 7e 7f 80 81 82 83 84 > > > > Sep 21 13:41:04 node1 corosync[11485]: [TOTEM ] Retransmit List: > 7a > > > 7b 7c 7d 7e 7f 80 81 82 83 84 > > > > Sep 21 13:41:04 node1 corosync[11485]: [TOTEM ] Retransmit List: > 7a > > > 7b 7c 7d 7e 7f 80 81 82 83 84 > > > > Sep 21 13:41:04 node1 corosync[11485]: [TOTEM ] Retransmit List: > 7a > > > 7b 7c 7d 7e 7f 80 81 82 83 84 > > > > Sep 21 13:41:05 node1 corosync[11485]: [TOTEM ] Retransmit List: > 7a > > > 7b 7c 7d 7e 7f 80 81 82 83 84 > > > > Sep 21 13:41:05 node1 corosync[11485]: [TOTEM ] Retransmit List: > 7a > > > 7b 7c 7d 7e 7f 80 81 82 83 84 > > > > Sep 21 13:41:05 node1 corosync[11485]: [TOTEM ] Retransmit List: > 7a > > > 7b 7c 7d 7e 7f 80 81 82 83 84 > > > > Sep 21 13:41:06 node1 corosync[11485]: [TOTEM ] Retransmit List: > 7a > > > 7b 7c 7d 7e 7f 80 81 82 83 84 > > > > Sep 21 13:41:06 node1 corosync[11485]: [TOTEM ] Retransmit List: > 7a > > > 7b 7c 7d 7e 7f 80 81 82 83 84 > > > > Sep 21 13:41:07 node1 corosync[11485]: [TOTEM ] Retransmit List: > 7a > > > 7b 7c 7d 7e 7f 80 81 82 83 84 > > > > Sep 21 13:41:07 node1 corosync[11485]: [TOTEM ] Retransmit List: > 7a > > > 7b 7c 7d 7e 7f 80 81 82 83 84 > > > > Sep 21 13:41:07 node1 corosync[11485]: [TOTEM ] Retransmit List: > 7a > > > 7b 7c 7d 7e 7f 80 81 82 83 84 > > > > Sep 21 13:41:08 node1 corosync[11485]: [TOTEM ] Retransmit List: > 7a > > > 7b 7c 7d 7e 7f 80 81 82 83 84 > > > > Sep 21 13:41:08 node1 corosync[11485]: [TOTEM ] Retransmit List: > 7a > > > 7b 7c 7d 7e 7f 80 81 82 83 84 > > > > Sep 21 13:41:09 node1 corosync[11485]: [TOTEM ] Retransmit List: > 7a > > > 7b 7c 7d 7e 7f 80 81 82 83 84 > > > > Sep 21 13:41:09 node1 corosync[11485]: [TOTEM ] Retransmit List: > 7a > > > 7b 7c 7d 7e 7f 80 81 82 83 84 > > > > Sep 21 13:41:17 node1 corosync[11485]: [TOTEM ] FAILED TO > RECEIVE > > > > Sep 21 13:41:26 node1 stonith-ng: [11498]: ERROR: ais_dispatch: > > > Receiving message body failed: (2) Library error: Resource > temporarily > > > unavailable (11) > > > > Sep 21 13:41:26 node1 crmd: [11503]: ERROR: ais_dispatch: > Receiving > > > message body failed: (2) Library error: Resource temporarily > > > unavailable (11) > > > > Sep 21 13:41:26 node1 attrd: [11501]: ERROR: ais_dispatch: > Receiving > > > message body failed: (2) Library error: Resource temporarily > > > unavailable (11) > > > > Sep 21 13:41:26 node1 cib: [11499]: ERROR: ais_dispatch: > Receiving > > > message body failed: (2) Library error: Resource temporarily > > > unavailable (11) > > > > Sep 21 13:41:26 node1 stonith-ng: [11498]: ERROR: ais_dispatch: > AIS > > > connection failed > > > > Sep 21 13:41:26 node1 crmd: [11503]: ERROR: ais_dispatch: AIS > > > connection failed > > > > Sep 21 13:41:26 node1 attrd: [11501]: ERROR: ais_dispatch: AIS > > > connection failed > > > > Sep 21 13:41:26 node1 cib: [11499]: ERROR: ais_dispatch: AIS > > > connection failed > > > > Sep 21 13:41:26 node1 stonith-ng: [11498]: ERROR: > > > stonith_peer_ais_destroy: AIS connection terminated > > > > Sep 21 13:41:26 node1 crmd: [11503]: info: crmd_ais_destroy: > > > connection closed > > > > Sep 21 13:41:26 node1 attrd: [11501]: CRIT: attrd_ais_destroy: > Lost > > > connection to OpenAIS service! > > > > Sep 21 13:41:26 node1 cib: [11499]: ERROR: cib_ais_destroy: AIS > > > connection terminated > > > > Sep 21 13:41:26 node1 crmd: [11503]: info: cib_native_msgready: > Lost > > > connection to the CIB service [11499]. > > > > Sep 21 13:41:26 node1 attrd: [11501]: notice: main: Exiting... > > > > Sep 21 13:41:26 node1 crmd: [11503]: CRIT: cib_native_dispatch: > Lost > > > connection to the CIB service [11499/callback]. > > > > Sep 21 13:41:26 node1 attrd: [11501]: ERROR: > > > attrd_cib_connection_destroy: Connection to the CIB terminated... > > > > Sep 21 13:41:26 node1 crmd: [11503]: CRIT: cib_native_dispatch: > Lost > > > connection to the CIB service [11499/command]. > > > > Sep 21 13:41:26 node1 crmd: [11503]: ERROR: > > > crmd_cib_connection_destroy: Connection to the CIB terminated... > > > > Sep 21 13:41:26 node1 crmd: [11503]: info: stonith_msgready: Lost > > > connection to the STONITH service [11498]. > > > > Sep 21 13:41:26 node1 crmd: [11503]: CRIT: stonith_dispatch: Lost > > > connection to the STONITH service [11498/callback]. > > > > Sep 21 13:41:26 node1 crmd: [11503]: CRIT: stonith_dispatch: Lost > > > connection to the STONITH service [11498/command]. > > > > Sep 21 13:41:26 node1 crmd: [11503]: CRIT: > > > tengine_stonith_connection_destroy: Fencing daemon connection > failed > > > > Sep 21 13:41:26 node1 crmd: [11503]: ERROR: do_log: FSA: Input > > > I_ERROR from crmd_cib_connection_destroy() received in state > S_ELECTION > > > > Sep 21 13:41:26 node1 crmd: [11503]: info: do_state_transition: > State > > > transition S_ELECTION -> S_RECOVERY [ input=I_ERROR > > > cause=C_FSA_INTERNAL origin=crmd_cib_connection_destroy ] > > > > Sep 21 13:41:26 node1 crmd: [11503]: ERROR: do_recover: Action > > > A_RECOVER (0000000001000000) not supported > > > > Sep 21 13:41:26 node1 crmd: [11503]: info: do_dc_release: DC role > > > released > > > > Sep 21 13:41:26 node1 crmd: [11503]: info: pe_connection_destroy: > > > Connection to the Policy Engine released > > > > Sep 21 13:41:26 node1 crmd: [11503]: info: do_te_control: > > > Transitioner is now inactive > > > > Sep 21 13:41:26 node1 crmd: [11503]: ERROR: do_log: FSA: Input > > > I_TERMINATE from do_recover() received in state S_RECOVERY > > > > Sep 21 13:41:26 node1 crmd: [11503]: info: do_state_transition: > State > > > transition S_RECOVERY -> S_TERMINATE [ input=I_TERMINATE > > > cause=C_FSA_INTERNAL origin=do_recover ] > > > > Sep 21 13:41:26 node1 crmd: [11503]: info: do_shutdown: > Disconnecting > > > STONITH... > > > > Sep 21 13:41:26 node1 crmd: [11503]: info: do_lrm_control: > > > Disconnected from the LRM > > > > Sep 21 13:41:26 node1 crmd: [11503]: info: do_ha_control: > > > Disconnected from OpenAIS > > > > Sep 21 13:41:26 node1 crmd: [11503]: info: do_cib_control: > > > Disconnecting CIB > > > > Sep 21 13:41:26 node1 crmd: [11503]: info: do_exit: Performing > > > A_EXIT_0 - gracefully exiting the CRMd > > > > Sep 21 13:41:26 node1 crmd: [11503]: ERROR: do_exit: Could not > > > recover from internal error > > > > Sep 21 13:41:26 node1 crmd: [11503]: info: free_mem: Dropping > > > I_RELEASE_SUCCESS: [ state=S_TERMINATE cause=C_FSA_INTERNAL > > > origin=do_dc_release ] > > > > Sep 21 13:41:26 node1 crmd: [11503]: info: free_mem: Dropping > > > I_TERMINATE: [ state=S_TERMINATE cause=C_FSA_INTERNAL > origin=do_stop ] > > > > Sep 21 13:41:26 node1 crmd: [11503]: info: crm_xml_cleanup: > Cleaning > > > up memory from libxml2 > > > > Sep 21 13:41:26 node1 crmd: [11503]: info: do_exit: [crmd] > stopped > > > (2) > > > > > > > > This is the configuration, nothing fancy here: > > > > > > > > node node1 > > > > node node2 > > > > property $id="cib-bootstrap-options" \ > > > > dc-version="1.1.5-5bd2b9154d7d9f86d7f56fe0a74072a5a6590c60" \ > > > > cluster-infrastructure="openais" \ > > > > expected-quorum-votes="2" \ > > > > stonith-enabled="false" \ > > > > no-quorum-policy="ignore" > > > > rsc_defaults $id="rsc-options" \ > > > > resource-stickiness="100" > > > > > > > > I am running corosync 1.3.1, openais 1.1.4 and pacemaker 1.1.5. > > > > > > > > I see this a lot too on a different cluster on this XEN host. > > > > > > > > What does "[TOTEM ] FAILED TO RECEIVE" exactly mean and why does > the > > > cluster get so unstable after that? > > > > > > > > Thank you, > > > > Sascha > > > > _______________________________________________ > > > > Linux-HA mailing list > > > > [email protected] > > > > http://lists.linux-ha.org/mailman/listinfo/linux-ha > > > > See also: http://linux-ha.org/ReportingProblems > > > _______________________________________________ > > > Linux-HA mailing list > > > [email protected] > > > http://lists.linux-ha.org/mailman/listinfo/linux-ha > > > See also: http://linux-ha.org/ReportingProblems > > _______________________________________________ > > Linux-HA mailing list > > [email protected] > > http://lists.linux-ha.org/mailman/listinfo/linux-ha > > See also: http://linux-ha.org/ReportingProblems > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
