Re: [Linux-HA] crm crashes after: TOTEM failed to receive

Dejan Muhamedagic Thu, 22 Sep 2011 03:25:32 -0700

Hi,

On Thu, Sep 22, 2011 at 09:23:11AM +0200, Sascha Hagedorn wrote:
> Hi Dejan,
> 
> thank you for your quick response. I will post this on the corosync mailing 
> list as well. But is this "TOTEM Failed to receive" message an indicator that 
> the multicast communication between the two is somehow erroneous?


IIRC, yes. You should probably check how your switch is handling
multicast.

Thanks,

Dejan

> Regards,
> Sascha
> 
> > -----Ursprüngliche Nachricht-----
> > Von: [email protected] [mailto:linux-ha-
> > [email protected]] Im Auftrag von Dejan Muhamedagic
> > Gesendet: Mittwoch, 21. September 2011 14:19
> > An: General Linux-HA mailing list
> > Betreff: Re: [Linux-HA] crm crashes after: TOTEM failed to receive
> >
> > Hi,
> >
> > On Wed, Sep 21, 2011 at 02:05:54PM +0200, Sascha Hagedorn wrote:
> > > Hi everyone,
> > >
> > > I am experiencing strange problems on a two node cluster. The nodes
> > are virtual machines on a XEN 5.6 SP2 host. I have set up a simple
> > cluster with no resources defined. After I setup the configuration and
> > commit it to the cluster the message  [TOTEM ] FAILED TO RECEIVE comes
> > up and shortly after that the AIS connection is lost and I cannot
> > access the crm anymore.
> >
> > It seems like corosync crashed. You should post to the corosync
> > ML, http://lists.corosync.org/mailman/listinfo, if possible with
> > the backtrace.
> >
> > Thanks,
> >
> > Dejan
> >
> > > Here is the output:
> > >
> > > Sep 21 13:41:04 node1 cib: [11499]: info: cib_replace_notify:
> > Replaced: 0.5.23 -> 0.6.1 from <null>
> > > Sep 21 13:41:04 node1 crmd: [11503]: info: abort_transition_graph:
> > te_update_diff:124 - Triggered transition abort (complete=1, tag=diff,
> > id=(null), magic=NA, cib=0.6.1) : Non-status change
> > > Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff- <cib
> > admin_epoch="0" epoch="5" num_updates="23" />
> > > Sep 21 13:41:04 node1 crmd: [11503]: info: do_state_transition: State
> > transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
> > cause=C_FSA_INTERNAL origin=abort_transition_graph ]
> > > Sep 21 13:41:04 node1 attrd: [11501]: info: do_cib_replaced: Sending
> > full refresh
> > > Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+ <cib epoch="6"
> > num_updates="1" admin_epoch="0" validate-with="pacemaker-1.2"
> > crm_feature_set="3.0.5" have-quorum="1" dc-uuid="node1" >
> > > Sep 21 13:41:04 node1 crmd: [11503]: info: do_state_transition: All 2
> > cluster nodes are eligible to run resources.
> > > Sep 21 13:41:04 node1 attrd: [11501]: info: attrd_trigger_update:
> > Sending flush op to all hosts for: probe_complete (true)
> > > Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+   <configuration
> > >
> > > Sep 21 13:41:04 node1 crmd: [11503]: info: do_pe_invoke: Query 70:
> > Requesting the current CIB: S_POLICY_ENGINE
> > > Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+     <crm_config >
> > > Sep 21 13:41:04 node1 crmd: [11503]: info: do_state_transition: State
> > transition S_POLICY_ENGINE -> S_ELECTION [ input=I_ELECTION
> > cause=C_FSA_INTERNAL origin=do_cib_replaced ]
> > > Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+
> > <cluster_property_set id="cib-bootstrap-options" >
> > > Sep 21 13:41:04 node1 crmd: [11503]: info: update_dc: Unset DC node1
> > > Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+         <nvpair
> > id="cib-bootstrap-options-stonith-enabled" name="stonith-enabled"
> > value="false" __crm_diff_marker__="added:top" />
> > > Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+         <nvpair
> > id="cib-bootstrap-options-no-quorum-policy" name="no-quorum-policy"
> > value="ignore" __crm_diff_marker__="added:top" />
> > > Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+
> > </cluster_property_set>
> > > Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+     </crm_config>
> > > Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+     <rsc_defaults
> > __crm_diff_marker__="added:top" >
> > > Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+
> > <meta_attributes id="rsc-options" >
> > > Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+         <nvpair
> > id="rsc-options-resource-stickiness" name="resource-stickiness"
> > value="100" />
> > > Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+
> > </meta_attributes>
> > > Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+
> > </rsc_defaults>
> > > Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+
> > </configuration>
> > > Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+ </cib>
> > > Sep 21 13:41:04 node1 cib: [11499]: info: cib_process_request:
> > Operation complete: op cib_replace for section 'all'
> > (origin=local/cibadmin/2, version=0.6.1): ok (rc=0)
> > > Sep 21 13:41:04 node1 cib: [11499]: info: cib_process_request:
> > Operation complete: op cib_modify for section nodes
> > (origin=local/crmd/68, version=0.6.2): ok (rc=0)
> > > Sep 21 13:41:04 node1 crmd: [11503]: info: config_query_callback:
> > Shutdown escalation occurs after: 1200000ms
> > > Sep 21 13:41:04 node1 crmd: [11503]: info: config_query_callback:
> > Checking for expired actions every 900000ms
> > > Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> > 7b 7c
> > > Sep 21 13:41:04 node1 crmd: [11503]: info: config_query_callback:
> > Sending expected-votes=2 to corosync
> > > Sep 21 13:41:04 node1 crmd: [11503]: info: ais_dispatch_message:
> > Membership 208: quorum retained
> > > Sep 21 13:41:04 node1 crmd: [11503]: info: crmd_ais_dispatch: Setting
> > expected votes to 2
> > > Sep 21 13:41:04 node1 cib: [11499]: info: cib_process_request:
> > Operation complete: op cib_modify for section crm_config
> > (origin=local/crmd/73, version=0.6.5): ok (rc=0)
> > > Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> > 7b 7c 7d 7e
> > > Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> > 7b 7c 7d 7e 7f 80
> > > Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> > 7b 7c 7d 7e 7f 80 81
> > > Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> > 7b 7c 7d 7e 7f 80 81
> > > Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> > 7b 7c 7d 7e 7f 80 81
> > > Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> > 7b 7c 7d 7e 7f 80 81 82 83 84
> > > Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> > 7b 7c 7d 7e 7f 80 81 82 83 84
> > > Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> > 7b 7c 7d 7e 7f 80 81 82 83 84
> > > Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> > 7b 7c 7d 7e 7f 80 81 82 83 84
> > > Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> > 7b 7c 7d 7e 7f 80 81 82 83 84
> > > Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> > 7b 7c 7d 7e 7f 80 81 82 83 84
> > > Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> > 7b 7c 7d 7e 7f 80 81 82 83 84
> > > Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> > 7b 7c 7d 7e 7f 80 81 82 83 84
> > > Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> > 7b 7c 7d 7e 7f 80 81 82 83 84
> > > Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> > 7b 7c 7d 7e 7f 80 81 82 83 84
> > > Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> > 7b 7c 7d 7e 7f 80 81 82 83 84
> > > Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> > 7b 7c 7d 7e 7f 80 81 82 83 84
> > > Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> > 7b 7c 7d 7e 7f 80 81 82 83 84
> > > Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> > 7b 7c 7d 7e 7f 80 81 82 83 84
> > > Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> > 7b 7c 7d 7e 7f 80 81 82 83 84
> > > Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> > 7b 7c 7d 7e 7f 80 81 82 83 84
> > > Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> > 7b 7c 7d 7e 7f 80 81 82 83 84
> > > Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> > 7b 7c 7d 7e 7f 80 81 82 83 84
> > > Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> > 7b 7c 7d 7e 7f 80 81 82 83 84
> > > Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> > 7b 7c 7d 7e 7f 80 81 82 83 84
> > > Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> > 7b 7c 7d 7e 7f 80 81 82 83 84
> > > Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> > 7b 7c 7d 7e 7f 80 81 82 83 84
> > > Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> > 7b 7c 7d 7e 7f 80 81 82 83 84
> > > Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> > 7b 7c 7d 7e 7f 80 81 82 83 84
> > > Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> > 7b 7c 7d 7e 7f 80 81 82 83 84
> > > Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> > 7b 7c 7d 7e 7f 80 81 82 83 84
> > > Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> > 7b 7c 7d 7e 7f 80 81 82 83 84
> > > Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> > 7b 7c 7d 7e 7f 80 81 82 83 84
> > > Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> > 7b 7c 7d 7e 7f 80 81 82 83 84
> > > Sep 21 13:41:05 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> > 7b 7c 7d 7e 7f 80 81 82 83 84
> > > Sep 21 13:41:05 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> > 7b 7c 7d 7e 7f 80 81 82 83 84
> > > Sep 21 13:41:05 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> > 7b 7c 7d 7e 7f 80 81 82 83 84
> > > Sep 21 13:41:06 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> > 7b 7c 7d 7e 7f 80 81 82 83 84
> > > Sep 21 13:41:06 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> > 7b 7c 7d 7e 7f 80 81 82 83 84
> > > Sep 21 13:41:07 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> > 7b 7c 7d 7e 7f 80 81 82 83 84
> > > Sep 21 13:41:07 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> > 7b 7c 7d 7e 7f 80 81 82 83 84
> > > Sep 21 13:41:07 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> > 7b 7c 7d 7e 7f 80 81 82 83 84
> > > Sep 21 13:41:08 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> > 7b 7c 7d 7e 7f 80 81 82 83 84
> > > Sep 21 13:41:08 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> > 7b 7c 7d 7e 7f 80 81 82 83 84
> > > Sep 21 13:41:09 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> > 7b 7c 7d 7e 7f 80 81 82 83 84
> > > Sep 21 13:41:09 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a
> > 7b 7c 7d 7e 7f 80 81 82 83 84
> > > Sep 21 13:41:17 node1 corosync[11485]:  [TOTEM ] FAILED TO RECEIVE
> > > Sep 21 13:41:26 node1 stonith-ng: [11498]: ERROR: ais_dispatch:
> > Receiving message body failed: (2) Library error: Resource temporarily
> > unavailable (11)
> > > Sep 21 13:41:26 node1 crmd: [11503]: ERROR: ais_dispatch: Receiving
> > message body failed: (2) Library error: Resource temporarily
> > unavailable (11)
> > > Sep 21 13:41:26 node1 attrd: [11501]: ERROR: ais_dispatch: Receiving
> > message body failed: (2) Library error: Resource temporarily
> > unavailable (11)
> > > Sep 21 13:41:26 node1 cib: [11499]: ERROR: ais_dispatch: Receiving
> > message body failed: (2) Library error: Resource temporarily
> > unavailable (11)
> > > Sep 21 13:41:26 node1 stonith-ng: [11498]: ERROR: ais_dispatch: AIS
> > connection failed
> > > Sep 21 13:41:26 node1 crmd: [11503]: ERROR: ais_dispatch: AIS
> > connection failed
> > > Sep 21 13:41:26 node1 attrd: [11501]: ERROR: ais_dispatch: AIS
> > connection failed
> > > Sep 21 13:41:26 node1 cib: [11499]: ERROR: ais_dispatch: AIS
> > connection failed
> > > Sep 21 13:41:26 node1 stonith-ng: [11498]: ERROR:
> > stonith_peer_ais_destroy: AIS connection terminated
> > > Sep 21 13:41:26 node1 crmd: [11503]: info: crmd_ais_destroy:
> > connection closed
> > > Sep 21 13:41:26 node1 attrd: [11501]: CRIT: attrd_ais_destroy: Lost
> > connection to OpenAIS service!
> > > Sep 21 13:41:26 node1 cib: [11499]: ERROR: cib_ais_destroy: AIS
> > connection terminated
> > > Sep 21 13:41:26 node1 crmd: [11503]: info: cib_native_msgready: Lost
> > connection to the CIB service [11499].
> > > Sep 21 13:41:26 node1 attrd: [11501]: notice: main: Exiting...
> > > Sep 21 13:41:26 node1 crmd: [11503]: CRIT: cib_native_dispatch: Lost
> > connection to the CIB service [11499/callback].
> > > Sep 21 13:41:26 node1 attrd: [11501]: ERROR:
> > attrd_cib_connection_destroy: Connection to the CIB terminated...
> > > Sep 21 13:41:26 node1 crmd: [11503]: CRIT: cib_native_dispatch: Lost
> > connection to the CIB service [11499/command].
> > > Sep 21 13:41:26 node1 crmd: [11503]: ERROR:
> > crmd_cib_connection_destroy: Connection to the CIB terminated...
> > > Sep 21 13:41:26 node1 crmd: [11503]: info: stonith_msgready: Lost
> > connection to the STONITH service [11498].
> > > Sep 21 13:41:26 node1 crmd: [11503]: CRIT: stonith_dispatch: Lost
> > connection to the STONITH service [11498/callback].
> > > Sep 21 13:41:26 node1 crmd: [11503]: CRIT: stonith_dispatch: Lost
> > connection to the STONITH service [11498/command].
> > > Sep 21 13:41:26 node1 crmd: [11503]: CRIT:
> > tengine_stonith_connection_destroy: Fencing daemon connection failed
> > > Sep 21 13:41:26 node1 crmd: [11503]: ERROR: do_log: FSA: Input
> > I_ERROR from crmd_cib_connection_destroy() received in state S_ELECTION
> > > Sep 21 13:41:26 node1 crmd: [11503]: info: do_state_transition: State
> > transition S_ELECTION -> S_RECOVERY [ input=I_ERROR
> > cause=C_FSA_INTERNAL origin=crmd_cib_connection_destroy ]
> > > Sep 21 13:41:26 node1 crmd: [11503]: ERROR: do_recover: Action
> > A_RECOVER (0000000001000000) not supported
> > > Sep 21 13:41:26 node1 crmd: [11503]: info: do_dc_release: DC role
> > released
> > > Sep 21 13:41:26 node1 crmd: [11503]: info: pe_connection_destroy:
> > Connection to the Policy Engine released
> > > Sep 21 13:41:26 node1 crmd: [11503]: info: do_te_control:
> > Transitioner is now inactive
> > > Sep 21 13:41:26 node1 crmd: [11503]: ERROR: do_log: FSA: Input
> > I_TERMINATE from do_recover() received in state S_RECOVERY
> > > Sep 21 13:41:26 node1 crmd: [11503]: info: do_state_transition: State
> > transition S_RECOVERY -> S_TERMINATE [ input=I_TERMINATE
> > cause=C_FSA_INTERNAL origin=do_recover ]
> > > Sep 21 13:41:26 node1 crmd: [11503]: info: do_shutdown: Disconnecting
> > STONITH...
> > > Sep 21 13:41:26 node1 crmd: [11503]: info: do_lrm_control:
> > Disconnected from the LRM
> > > Sep 21 13:41:26 node1 crmd: [11503]: info: do_ha_control:
> > Disconnected from OpenAIS
> > > Sep 21 13:41:26 node1 crmd: [11503]: info: do_cib_control:
> > Disconnecting CIB
> > > Sep 21 13:41:26 node1 crmd: [11503]: info: do_exit: Performing
> > A_EXIT_0 - gracefully exiting the CRMd
> > > Sep 21 13:41:26 node1 crmd: [11503]: ERROR: do_exit: Could not
> > recover from internal error
> > > Sep 21 13:41:26 node1 crmd: [11503]: info: free_mem: Dropping
> > I_RELEASE_SUCCESS: [ state=S_TERMINATE cause=C_FSA_INTERNAL
> > origin=do_dc_release ]
> > > Sep 21 13:41:26 node1 crmd: [11503]: info: free_mem: Dropping
> > I_TERMINATE: [ state=S_TERMINATE cause=C_FSA_INTERNAL origin=do_stop ]
> > > Sep 21 13:41:26 node1 crmd: [11503]: info: crm_xml_cleanup: Cleaning
> > up memory from libxml2
> > > Sep 21 13:41:26 node1 crmd: [11503]: info: do_exit: [crmd] stopped
> > (2)
> > >
> > > This is the configuration, nothing fancy here:
> > >
> > > node node1
> > > node node2
> > > property $id="cib-bootstrap-options" \
> > >     dc-version="1.1.5-5bd2b9154d7d9f86d7f56fe0a74072a5a6590c60" \
> > >     cluster-infrastructure="openais" \
> > >     expected-quorum-votes="2" \
> > >     stonith-enabled="false" \
> > >     no-quorum-policy="ignore"
> > > rsc_defaults $id="rsc-options" \
> > >     resource-stickiness="100"
> > >
> > > I am running corosync 1.3.1, openais 1.1.4 and pacemaker 1.1.5.
> > >
> > > I see this a lot too on a different cluster on this XEN host.
> > >
> > > What does "[TOTEM ] FAILED TO RECEIVE" exactly mean and why does the
> > cluster get so unstable after that?
> > >
> > > Thank you,
> > > Sascha
> > > _______________________________________________
> > > Linux-HA mailing list
> > > [email protected]
> > > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > > See also: http://linux-ha.org/ReportingProblems
> > _______________________________________________
> > Linux-HA mailing list
> > [email protected]
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > See also: http://linux-ha.org/ReportingProblems
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] crm crashes after: TOTEM failed to receive

Reply via email to