Re: [Linux-HA] crm crashes after: TOTEM failed to receive

Sascha Hagedorn Thu, 22 Sep 2011 00:23:35 -0700

Hi Dejan,

thank you for your quick response. I will post this on the corosync mailing 
list as well. But is this "TOTEM Failed to receive" message an indicator that 
the multicast communication between the two is somehow erroneous?


Regards,
Sascha

> -----Ursprüngliche Nachricht-----
> Von: [email protected] [mailto:linux-ha-
> [email protected]] Im Auftrag von Dejan Muhamedagic
> Gesendet: Mittwoch, 21. September 2011 14:19
> An: General Linux-HA mailing list
> Betreff: Re: [Linux-HA] crm crashes after: TOTEM failed to receive
>
> Hi,
>
> On Wed, Sep 21, 2011 at 02:05:54PM +0200, Sascha Hagedorn wrote:
> > Hi everyone,
> >
> > I am experiencing strange problems on a two node cluster. The nodes
> are virtual machines on a XEN 5.6 SP2 host. I have set up a simple
> cluster with no resources defined. After I setup the configuration and
> commit it to the cluster the message  [TOTEM ] FAILED TO RECEIVE comes
> up and shortly after that the AIS connection is lost and I cannot
> access the crm anymore.
>
> It seems like corosync crashed. You should post to the corosync
> ML, http://lists.corosync.org/mailman/listinfo, if possible with
> the backtrace.
>
> Thanks,
>
> Dejan
>
> > Here is the output:
> >
> > Sep 21 13:41:04 node1 cib: [11499]: info: cib_replace_notify:
> Replaced: 0.5.23 -> 0.6.1 from <null>
> > Sep 21 13:41:04 node1 crmd: [11503]: info: abort_transition_graph:
> te_update_diff:124 - Triggered transition abort (complete=1, tag=diff,
> id=(null), magic=NA, cib=0.6.1) : Non-status change
> > Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff- <cib
> admin_epoch="0" epoch="5" num_updates="23" />
> > Sep 21 13:41:04 node1 crmd: [11503]: info: do_state_transition: State
> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
> cause=C_FSA_INTERNAL origin=abort_transition_graph ]
> > Sep 21 13:41:04 node1 attrd: [11501]: info: do_cib_replaced: Sending
> full refresh
> > Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+ <cib epoch="6"
> num_updates="1" admin_epoch="0" validate-with="pacemaker-1.2"
> crm_feature_set="3.0.5" have-quorum="1" dc-uuid="node1" >
> > Sep 21 13:41:04 node1 crmd: [11503]: info: do_state_transition: All 2
> cluster nodes are eligible to run resources.
> > Sep 21 13:41:04 node1 attrd: [11501]: info: attrd_trigger_update:
> Sending flush op to all hosts for: probe_complete (true)
> > Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+   <configuration
> >
> > Sep 21 13:41:04 node1 crmd: [11503]: info: do_pe_invoke: Query 70:
> Requesting the current CIB: S_POLICY_ENGINE
> > Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+     <crm_config >
> > Sep 21 13:41:04 node1 crmd: [11503]: info: do_state_transition: State
> transition S_POLICY_ENGINE -> S_ELECTION [ input=I_ELECTION
> cause=C_FSA_INTERNAL origin=do_cib_replaced ]
> > Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+
> <cluster_property_set id="cib-bootstrap-options" >
> > Sep 21 13:41:04 node1 crmd: [11503]: info: update_dc: Unset DC node1
> > Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+         <nvpair
> id="cib-bootstrap-options-stonith-enabled" name="stonith-enabled"
> value="false" __crm_diff_marker__="added:top" />
> > Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+         <nvpair
> id="cib-bootstrap-options-no-quorum-policy" name="no-quorum-policy"
> value="ignore" __crm_diff_marker__="added:top" />
> > Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+
> </cluster_property_set>
> > Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+     </crm_config>
> > Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+     <rsc_defaults
> __crm_diff_marker__="added:top" >
> > Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+
> <meta_attributes id="rsc-options" >
> > Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+         <nvpair
> id="rsc-options-resource-stickiness" name="resource-stickiness"
> value="100" />
> > Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+
> </meta_attributes>
> > Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+
> </rsc_defaults>
> > Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+
> </configuration>
> > Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+ </cib>
> > Sep 21 13:41:04 node1 cib: [11499]: info: cib_process_request:
> Operation complete: op cib_replace for section 'all'
> (origin=local/cibadmin/2, version=0.6.1): ok (rc=0)
> > Sep 21 13:41:04 node1 cib: [11499]: info: cib_process_request:
> Operation complete: op cib_modify for section nodes
> (origin=local/crmd/68, version=0.6.2): ok (rc=0)
> > Sep 21 13:41:04 node1 crmd: [11503]: info: config_query_callback:
> Shutdown escalation occurs after: 1200000ms
> > Sep 21 13:41:04 node1 crmd: [11503]: info: config_query_callback:
> Checking for expired actions every 900000ms
> > Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> 7b 7c
> > Sep 21 13:41:04 node1 crmd: [11503]: info: config_query_callback:
> Sending expected-votes=2 to corosync
> > Sep 21 13:41:04 node1 crmd: [11503]: info: ais_dispatch_message:
> Membership 208: quorum retained
> > Sep 21 13:41:04 node1 crmd: [11503]: info: crmd_ais_dispatch: Setting
> expected votes to 2
> > Sep 21 13:41:04 node1 cib: [11499]: info: cib_process_request:
> Operation complete: op cib_modify for section crm_config
> (origin=local/crmd/73, version=0.6.5): ok (rc=0)
> > Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> 7b 7c 7d 7e
> > Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> 7b 7c 7d 7e 7f 80
> > Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> 7b 7c 7d 7e 7f 80 81
> > Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> 7b 7c 7d 7e 7f 80 81
> > Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> 7b 7c 7d 7e 7f 80 81
> > Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> 7b 7c 7d 7e 7f 80 81 82 83 84
> > Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> 7b 7c 7d 7e 7f 80 81 82 83 84
> > Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> 7b 7c 7d 7e 7f 80 81 82 83 84
> > Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> 7b 7c 7d 7e 7f 80 81 82 83 84
> > Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> 7b 7c 7d 7e 7f 80 81 82 83 84
> > Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> 7b 7c 7d 7e 7f 80 81 82 83 84
> > Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> 7b 7c 7d 7e 7f 80 81 82 83 84
> > Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> 7b 7c 7d 7e 7f 80 81 82 83 84
> > Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> 7b 7c 7d 7e 7f 80 81 82 83 84
> > Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> 7b 7c 7d 7e 7f 80 81 82 83 84
> > Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> 7b 7c 7d 7e 7f 80 81 82 83 84
> > Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> 7b 7c 7d 7e 7f 80 81 82 83 84
> > Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> 7b 7c 7d 7e 7f 80 81 82 83 84
> > Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> 7b 7c 7d 7e 7f 80 81 82 83 84
> > Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> 7b 7c 7d 7e 7f 80 81 82 83 84
> > Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> 7b 7c 7d 7e 7f 80 81 82 83 84
> > Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> 7b 7c 7d 7e 7f 80 81 82 83 84
> > Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> 7b 7c 7d 7e 7f 80 81 82 83 84
> > Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> 7b 7c 7d 7e 7f 80 81 82 83 84
> > Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> 7b 7c 7d 7e 7f 80 81 82 83 84
> > Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> 7b 7c 7d 7e 7f 80 81 82 83 84
> > Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> 7b 7c 7d 7e 7f 80 81 82 83 84
> > Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> 7b 7c 7d 7e 7f 80 81 82 83 84
> > Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> 7b 7c 7d 7e 7f 80 81 82 83 84
> > Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> 7b 7c 7d 7e 7f 80 81 82 83 84
> > Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> 7b 7c 7d 7e 7f 80 81 82 83 84
> > Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> 7b 7c 7d 7e 7f 80 81 82 83 84
> > Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> 7b 7c 7d 7e 7f 80 81 82 83 84
> > Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> 7b 7c 7d 7e 7f 80 81 82 83 84
> > Sep 21 13:41:05 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> 7b 7c 7d 7e 7f 80 81 82 83 84
> > Sep 21 13:41:05 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> 7b 7c 7d 7e 7f 80 81 82 83 84
> > Sep 21 13:41:05 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> 7b 7c 7d 7e 7f 80 81 82 83 84
> > Sep 21 13:41:06 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> 7b 7c 7d 7e 7f 80 81 82 83 84
> > Sep 21 13:41:06 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> 7b 7c 7d 7e 7f 80 81 82 83 84
> > Sep 21 13:41:07 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> 7b 7c 7d 7e 7f 80 81 82 83 84
> > Sep 21 13:41:07 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> 7b 7c 7d 7e 7f 80 81 82 83 84
> > Sep 21 13:41:07 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> 7b 7c 7d 7e 7f 80 81 82 83 84
> > Sep 21 13:41:08 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> 7b 7c 7d 7e 7f 80 81 82 83 84
> > Sep 21 13:41:08 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> 7b 7c 7d 7e 7f 80 81 82 83 84
> > Sep 21 13:41:09 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> 7b 7c 7d 7e 7f 80 81 82 83 84
> > Sep 21 13:41:09 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> 7b 7c 7d 7e 7f 80 81 82 83 84
> > Sep 21 13:41:17 node1 corosync[11485]:  [TOTEM ] FAILED TO RECEIVE
> > Sep 21 13:41:26 node1 stonith-ng: [11498]: ERROR: ais_dispatch:
> Receiving message body failed: (2) Library error: Resource temporarily
> unavailable (11)
> > Sep 21 13:41:26 node1 crmd: [11503]: ERROR: ais_dispatch: Receiving
> message body failed: (2) Library error: Resource temporarily
> unavailable (11)
> > Sep 21 13:41:26 node1 attrd: [11501]: ERROR: ais_dispatch: Receiving
> message body failed: (2) Library error: Resource temporarily
> unavailable (11)
> > Sep 21 13:41:26 node1 cib: [11499]: ERROR: ais_dispatch: Receiving
> message body failed: (2) Library error: Resource temporarily
> unavailable (11)
> > Sep 21 13:41:26 node1 stonith-ng: [11498]: ERROR: ais_dispatch: AIS
> connection failed
> > Sep 21 13:41:26 node1 crmd: [11503]: ERROR: ais_dispatch: AIS
> connection failed
> > Sep 21 13:41:26 node1 attrd: [11501]: ERROR: ais_dispatch: AIS
> connection failed
> > Sep 21 13:41:26 node1 cib: [11499]: ERROR: ais_dispatch: AIS
> connection failed
> > Sep 21 13:41:26 node1 stonith-ng: [11498]: ERROR:
> stonith_peer_ais_destroy: AIS connection terminated
> > Sep 21 13:41:26 node1 crmd: [11503]: info: crmd_ais_destroy:
> connection closed
> > Sep 21 13:41:26 node1 attrd: [11501]: CRIT: attrd_ais_destroy: Lost
> connection to OpenAIS service!
> > Sep 21 13:41:26 node1 cib: [11499]: ERROR: cib_ais_destroy: AIS
> connection terminated
> > Sep 21 13:41:26 node1 crmd: [11503]: info: cib_native_msgready: Lost
> connection to the CIB service [11499].
> > Sep 21 13:41:26 node1 attrd: [11501]: notice: main: Exiting...
> > Sep 21 13:41:26 node1 crmd: [11503]: CRIT: cib_native_dispatch: Lost
> connection to the CIB service [11499/callback].
> > Sep 21 13:41:26 node1 attrd: [11501]: ERROR:
> attrd_cib_connection_destroy: Connection to the CIB terminated...
> > Sep 21 13:41:26 node1 crmd: [11503]: CRIT: cib_native_dispatch: Lost
> connection to the CIB service [11499/command].
> > Sep 21 13:41:26 node1 crmd: [11503]: ERROR:
> crmd_cib_connection_destroy: Connection to the CIB terminated...
> > Sep 21 13:41:26 node1 crmd: [11503]: info: stonith_msgready: Lost
> connection to the STONITH service [11498].
> > Sep 21 13:41:26 node1 crmd: [11503]: CRIT: stonith_dispatch: Lost
> connection to the STONITH service [11498/callback].
> > Sep 21 13:41:26 node1 crmd: [11503]: CRIT: stonith_dispatch: Lost
> connection to the STONITH service [11498/command].
> > Sep 21 13:41:26 node1 crmd: [11503]: CRIT:
> tengine_stonith_connection_destroy: Fencing daemon connection failed
> > Sep 21 13:41:26 node1 crmd: [11503]: ERROR: do_log: FSA: Input
> I_ERROR from crmd_cib_connection_destroy() received in state S_ELECTION
> > Sep 21 13:41:26 node1 crmd: [11503]: info: do_state_transition: State
> transition S_ELECTION -> S_RECOVERY [ input=I_ERROR
> cause=C_FSA_INTERNAL origin=crmd_cib_connection_destroy ]
> > Sep 21 13:41:26 node1 crmd: [11503]: ERROR: do_recover: Action
> A_RECOVER (0000000001000000) not supported
> > Sep 21 13:41:26 node1 crmd: [11503]: info: do_dc_release: DC role
> released
> > Sep 21 13:41:26 node1 crmd: [11503]: info: pe_connection_destroy:
> Connection to the Policy Engine released
> > Sep 21 13:41:26 node1 crmd: [11503]: info: do_te_control:
> Transitioner is now inactive
> > Sep 21 13:41:26 node1 crmd: [11503]: ERROR: do_log: FSA: Input
> I_TERMINATE from do_recover() received in state S_RECOVERY
> > Sep 21 13:41:26 node1 crmd: [11503]: info: do_state_transition: State
> transition S_RECOVERY -> S_TERMINATE [ input=I_TERMINATE
> cause=C_FSA_INTERNAL origin=do_recover ]
> > Sep 21 13:41:26 node1 crmd: [11503]: info: do_shutdown: Disconnecting
> STONITH...
> > Sep 21 13:41:26 node1 crmd: [11503]: info: do_lrm_control:
> Disconnected from the LRM
> > Sep 21 13:41:26 node1 crmd: [11503]: info: do_ha_control:
> Disconnected from OpenAIS
> > Sep 21 13:41:26 node1 crmd: [11503]: info: do_cib_control:
> Disconnecting CIB
> > Sep 21 13:41:26 node1 crmd: [11503]: info: do_exit: Performing
> A_EXIT_0 - gracefully exiting the CRMd
> > Sep 21 13:41:26 node1 crmd: [11503]: ERROR: do_exit: Could not
> recover from internal error
> > Sep 21 13:41:26 node1 crmd: [11503]: info: free_mem: Dropping
> I_RELEASE_SUCCESS: [ state=S_TERMINATE cause=C_FSA_INTERNAL
> origin=do_dc_release ]
> > Sep 21 13:41:26 node1 crmd: [11503]: info: free_mem: Dropping
> I_TERMINATE: [ state=S_TERMINATE cause=C_FSA_INTERNAL origin=do_stop ]
> > Sep 21 13:41:26 node1 crmd: [11503]: info: crm_xml_cleanup: Cleaning
> up memory from libxml2
> > Sep 21 13:41:26 node1 crmd: [11503]: info: do_exit: [crmd] stopped
> (2)
> >
> > This is the configuration, nothing fancy here:
> >
> > node node1
> > node node2
> > property $id="cib-bootstrap-options" \
> >     dc-version="1.1.5-5bd2b9154d7d9f86d7f56fe0a74072a5a6590c60" \
> >     cluster-infrastructure="openais" \
> >     expected-quorum-votes="2" \
> >     stonith-enabled="false" \
> >     no-quorum-policy="ignore"
> > rsc_defaults $id="rsc-options" \
> >     resource-stickiness="100"
> >
> > I am running corosync 1.3.1, openais 1.1.4 and pacemaker 1.1.5.
> >
> > I see this a lot too on a different cluster on this XEN host.
> >
> > What does "[TOTEM ] FAILED TO RECEIVE" exactly mean and why does the
> cluster get so unstable after that?
> >
> > Thank you,
> > Sascha
> > _______________________________________________
> > Linux-HA mailing list
> > [email protected]
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > See also: http://linux-ha.org/ReportingProblems
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] crm crashes after: TOTEM failed to receive

Reply via email to