Re: [Linux-HA] crm crashes after: TOTEM failed to receive

Dejan Muhamedagic Wed, 21 Sep 2011 05:18:59 -0700

Hi,

On Wed, Sep 21, 2011 at 02:05:54PM +0200, Sascha Hagedorn wrote:
> Hi everyone,
> 
> I am experiencing strange problems on a two node cluster. The nodes are 
> virtual machines on a XEN 5.6 SP2 host. I have set up a simple cluster with 
> no resources defined. After I setup the configuration and commit it to the 
> cluster the message  [TOTEM ] FAILED TO RECEIVE comes up and shortly after 
> that the AIS connection is lost and I cannot access the crm anymore.


It seems like corosync crashed. You should post to the corosync
ML, http://lists.corosync.org/mailman/listinfo, if possible with
the backtrace.

Thanks,

Dejan

> Here is the output:
> 
> Sep 21 13:41:04 node1 cib: [11499]: info: cib_replace_notify: Replaced: 
> 0.5.23 -> 0.6.1 from <null>
> Sep 21 13:41:04 node1 crmd: [11503]: info: abort_transition_graph: 
> te_update_diff:124 - Triggered transition abort (complete=1, tag=diff, 
> id=(null), magic=NA, cib=0.6.1) : Non-status change
> Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff- <cib admin_epoch="0" 
> epoch="5" num_updates="23" />
> Sep 21 13:41:04 node1 crmd: [11503]: info: do_state_transition: State 
> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL 
> origin=abort_transition_graph ]
> Sep 21 13:41:04 node1 attrd: [11501]: info: do_cib_replaced: Sending full 
> refresh
> Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+ <cib epoch="6" 
> num_updates="1" admin_epoch="0" validate-with="pacemaker-1.2" 
> crm_feature_set="3.0.5" have-quorum="1" dc-uuid="node1" >
> Sep 21 13:41:04 node1 crmd: [11503]: info: do_state_transition: All 2 cluster 
> nodes are eligible to run resources.
> Sep 21 13:41:04 node1 attrd: [11501]: info: attrd_trigger_update: Sending 
> flush op to all hosts for: probe_complete (true)
> Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+   <configuration >
> Sep 21 13:41:04 node1 crmd: [11503]: info: do_pe_invoke: Query 70: Requesting 
> the current CIB: S_POLICY_ENGINE
> Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+     <crm_config >
> Sep 21 13:41:04 node1 crmd: [11503]: info: do_state_transition: State 
> transition S_POLICY_ENGINE -> S_ELECTION [ input=I_ELECTION 
> cause=C_FSA_INTERNAL origin=do_cib_replaced ]
> Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+       
> <cluster_property_set id="cib-bootstrap-options" >
> Sep 21 13:41:04 node1 crmd: [11503]: info: update_dc: Unset DC node1
> Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+         <nvpair 
> id="cib-bootstrap-options-stonith-enabled" name="stonith-enabled" 
> value="false" __crm_diff_marker__="added:top" />
> Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+         <nvpair 
> id="cib-bootstrap-options-no-quorum-policy" name="no-quorum-policy" 
> value="ignore" __crm_diff_marker__="added:top" />
> Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+       
> </cluster_property_set>
> Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+     </crm_config>
> Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+     <rsc_defaults 
> __crm_diff_marker__="added:top" >
> Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+       <meta_attributes 
> id="rsc-options" >
> Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+         <nvpair 
> id="rsc-options-resource-stickiness" name="resource-stickiness" value="100" />
> Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+       </meta_attributes>
> Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+     </rsc_defaults>
> Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+   </configuration>
> Sep 21 13:41:04 node1 cib: [11499]: info: cib:diff+ </cib>
> Sep 21 13:41:04 node1 cib: [11499]: info: cib_process_request: Operation 
> complete: op cib_replace for section 'all' (origin=local/cibadmin/2, 
> version=0.6.1): ok (rc=0)
> Sep 21 13:41:04 node1 cib: [11499]: info: cib_process_request: Operation 
> complete: op cib_modify for section nodes (origin=local/crmd/68, 
> version=0.6.2): ok (rc=0)
> Sep 21 13:41:04 node1 crmd: [11503]: info: config_query_callback: Shutdown 
> escalation occurs after: 1200000ms
> Sep 21 13:41:04 node1 crmd: [11503]: info: config_query_callback: Checking 
> for expired actions every 900000ms
> Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a 7b 7c 
> Sep 21 13:41:04 node1 crmd: [11503]: info: config_query_callback: Sending 
> expected-votes=2 to corosync
> Sep 21 13:41:04 node1 crmd: [11503]: info: ais_dispatch_message: Membership 
> 208: quorum retained
> Sep 21 13:41:04 node1 crmd: [11503]: info: crmd_ais_dispatch: Setting 
> expected votes to 2
> Sep 21 13:41:04 node1 cib: [11499]: info: cib_process_request: Operation 
> complete: op cib_modify for section crm_config (origin=local/crmd/73, 
> version=0.6.5): ok (rc=0)
> Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a 7b 7c 7d 
> 7e 
> Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a 7b 7c 7d 
> 7e 7f 80 
> Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a 7b 7c 7d 
> 7e 7f 80 81 
> Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a 7b 7c 7d 
> 7e 7f 80 81 
> Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a 7b 7c 7d 
> 7e 7f 80 81 
> Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a 7b 7c 7d 
> 7e 7f 80 81 82 83 84 
> Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a 7b 7c 7d 
> 7e 7f 80 81 82 83 84 
> Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a 7b 7c 7d 
> 7e 7f 80 81 82 83 84 
> Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a 7b 7c 7d 
> 7e 7f 80 81 82 83 84 
> Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a 7b 7c 7d 
> 7e 7f 80 81 82 83 84 
> Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a 7b 7c 7d 
> 7e 7f 80 81 82 83 84 
> Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a 7b 7c 7d 
> 7e 7f 80 81 82 83 84 
> Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a 7b 7c 7d 
> 7e 7f 80 81 82 83 84 
> Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a 7b 7c 7d 
> 7e 7f 80 81 82 83 84 
> Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a 7b 7c 7d 
> 7e 7f 80 81 82 83 84 
> Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a 7b 7c 7d 
> 7e 7f 80 81 82 83 84 
> Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a 7b 7c 7d 
> 7e 7f 80 81 82 83 84 
> Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a 7b 7c 7d 
> 7e 7f 80 81 82 83 84 
> Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a 7b 7c 7d 
> 7e 7f 80 81 82 83 84 
> Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a 7b 7c 7d 
> 7e 7f 80 81 82 83 84 
> Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a 7b 7c 7d 
> 7e 7f 80 81 82 83 84 
> Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a 7b 7c 7d 
> 7e 7f 80 81 82 83 84 
> Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a 7b 7c 7d 
> 7e 7f 80 81 82 83 84 
> Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a 7b 7c 7d 
> 7e 7f 80 81 82 83 84 
> Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a 7b 7c 7d 
> 7e 7f 80 81 82 83 84 
> Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a 7b 7c 7d 
> 7e 7f 80 81 82 83 84 
> Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a 7b 7c 7d 
> 7e 7f 80 81 82 83 84 
> Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a 7b 7c 7d 
> 7e 7f 80 81 82 83 84 
> Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a 7b 7c 7d 
> 7e 7f 80 81 82 83 84 
> Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a 7b 7c 7d 
> 7e 7f 80 81 82 83 84 
> Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a 7b 7c 7d 
> 7e 7f 80 81 82 83 84 
> Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a 7b 7c 7d 
> 7e 7f 80 81 82 83 84 
> Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a 7b 7c 7d 
> 7e 7f 80 81 82 83 84 
> Sep 21 13:41:04 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a 7b 7c 7d 
> 7e 7f 80 81 82 83 84 
> Sep 21 13:41:05 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a 7b 7c 7d 
> 7e 7f 80 81 82 83 84 
> Sep 21 13:41:05 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a 7b 7c 7d 
> 7e 7f 80 81 82 83 84 
> Sep 21 13:41:05 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a 7b 7c 7d 
> 7e 7f 80 81 82 83 84 
> Sep 21 13:41:06 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a 7b 7c 7d 
> 7e 7f 80 81 82 83 84 
> Sep 21 13:41:06 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a 7b 7c 7d 
> 7e 7f 80 81 82 83 84 
> Sep 21 13:41:07 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a 7b 7c 7d 
> 7e 7f 80 81 82 83 84 
> Sep 21 13:41:07 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a 7b 7c 7d 
> 7e 7f 80 81 82 83 84 
> Sep 21 13:41:07 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a 7b 7c 7d 
> 7e 7f 80 81 82 83 84 
> Sep 21 13:41:08 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a 7b 7c 7d 
> 7e 7f 80 81 82 83 84 
> Sep 21 13:41:08 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a 7b 7c 7d 
> 7e 7f 80 81 82 83 84 
> Sep 21 13:41:09 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a 7b 7c 7d 
> 7e 7f 80 81 82 83 84 
> Sep 21 13:41:09 node1 corosync[11485]:  [TOTEM ] Retransmit List: 7a 7b 7c 7d 
> 7e 7f 80 81 82 83 84 
> Sep 21 13:41:17 node1 corosync[11485]:  [TOTEM ] FAILED TO RECEIVE
> Sep 21 13:41:26 node1 stonith-ng: [11498]: ERROR: ais_dispatch: Receiving 
> message body failed: (2) Library error: Resource temporarily unavailable (11)
> Sep 21 13:41:26 node1 crmd: [11503]: ERROR: ais_dispatch: Receiving message 
> body failed: (2) Library error: Resource temporarily unavailable (11)
> Sep 21 13:41:26 node1 attrd: [11501]: ERROR: ais_dispatch: Receiving message 
> body failed: (2) Library error: Resource temporarily unavailable (11)
> Sep 21 13:41:26 node1 cib: [11499]: ERROR: ais_dispatch: Receiving message 
> body failed: (2) Library error: Resource temporarily unavailable (11)
> Sep 21 13:41:26 node1 stonith-ng: [11498]: ERROR: ais_dispatch: AIS 
> connection failed
> Sep 21 13:41:26 node1 crmd: [11503]: ERROR: ais_dispatch: AIS connection 
> failed
> Sep 21 13:41:26 node1 attrd: [11501]: ERROR: ais_dispatch: AIS connection 
> failed
> Sep 21 13:41:26 node1 cib: [11499]: ERROR: ais_dispatch: AIS connection failed
> Sep 21 13:41:26 node1 stonith-ng: [11498]: ERROR: stonith_peer_ais_destroy: 
> AIS connection terminated
> Sep 21 13:41:26 node1 crmd: [11503]: info: crmd_ais_destroy: connection closed
> Sep 21 13:41:26 node1 attrd: [11501]: CRIT: attrd_ais_destroy: Lost 
> connection to OpenAIS service!
> Sep 21 13:41:26 node1 cib: [11499]: ERROR: cib_ais_destroy: AIS connection 
> terminated
> Sep 21 13:41:26 node1 crmd: [11503]: info: cib_native_msgready: Lost 
> connection to the CIB service [11499].
> Sep 21 13:41:26 node1 attrd: [11501]: notice: main: Exiting...
> Sep 21 13:41:26 node1 crmd: [11503]: CRIT: cib_native_dispatch: Lost 
> connection to the CIB service [11499/callback].
> Sep 21 13:41:26 node1 attrd: [11501]: ERROR: attrd_cib_connection_destroy: 
> Connection to the CIB terminated...
> Sep 21 13:41:26 node1 crmd: [11503]: CRIT: cib_native_dispatch: Lost 
> connection to the CIB service [11499/command].
> Sep 21 13:41:26 node1 crmd: [11503]: ERROR: crmd_cib_connection_destroy: 
> Connection to the CIB terminated...
> Sep 21 13:41:26 node1 crmd: [11503]: info: stonith_msgready: Lost connection 
> to the STONITH service [11498].
> Sep 21 13:41:26 node1 crmd: [11503]: CRIT: stonith_dispatch: Lost connection 
> to the STONITH service [11498/callback].
> Sep 21 13:41:26 node1 crmd: [11503]: CRIT: stonith_dispatch: Lost connection 
> to the STONITH service [11498/command].
> Sep 21 13:41:26 node1 crmd: [11503]: CRIT: 
> tengine_stonith_connection_destroy: Fencing daemon connection failed
> Sep 21 13:41:26 node1 crmd: [11503]: ERROR: do_log: FSA: Input I_ERROR from 
> crmd_cib_connection_destroy() received in state S_ELECTION
> Sep 21 13:41:26 node1 crmd: [11503]: info: do_state_transition: State 
> transition S_ELECTION -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL 
> origin=crmd_cib_connection_destroy ]
> Sep 21 13:41:26 node1 crmd: [11503]: ERROR: do_recover: Action A_RECOVER 
> (0000000001000000) not supported
> Sep 21 13:41:26 node1 crmd: [11503]: info: do_dc_release: DC role released
> Sep 21 13:41:26 node1 crmd: [11503]: info: pe_connection_destroy: Connection 
> to the Policy Engine released
> Sep 21 13:41:26 node1 crmd: [11503]: info: do_te_control: Transitioner is now 
> inactive
> Sep 21 13:41:26 node1 crmd: [11503]: ERROR: do_log: FSA: Input I_TERMINATE 
> from do_recover() received in state S_RECOVERY
> Sep 21 13:41:26 node1 crmd: [11503]: info: do_state_transition: State 
> transition S_RECOVERY -> S_TERMINATE [ input=I_TERMINATE cause=C_FSA_INTERNAL 
> origin=do_recover ]
> Sep 21 13:41:26 node1 crmd: [11503]: info: do_shutdown: Disconnecting 
> STONITH...
> Sep 21 13:41:26 node1 crmd: [11503]: info: do_lrm_control: Disconnected from 
> the LRM
> Sep 21 13:41:26 node1 crmd: [11503]: info: do_ha_control: Disconnected from 
> OpenAIS
> Sep 21 13:41:26 node1 crmd: [11503]: info: do_cib_control: Disconnecting CIB
> Sep 21 13:41:26 node1 crmd: [11503]: info: do_exit: Performing A_EXIT_0 - 
> gracefully exiting the CRMd
> Sep 21 13:41:26 node1 crmd: [11503]: ERROR: do_exit: Could not recover from 
> internal error
> Sep 21 13:41:26 node1 crmd: [11503]: info: free_mem: Dropping 
> I_RELEASE_SUCCESS: [ state=S_TERMINATE cause=C_FSA_INTERNAL 
> origin=do_dc_release ]
> Sep 21 13:41:26 node1 crmd: [11503]: info: free_mem: Dropping I_TERMINATE: [ 
> state=S_TERMINATE cause=C_FSA_INTERNAL origin=do_stop ]
> Sep 21 13:41:26 node1 crmd: [11503]: info: crm_xml_cleanup: Cleaning up 
> memory from libxml2
> Sep 21 13:41:26 node1 crmd: [11503]: info: do_exit: [crmd] stopped (2)
> 
> This is the configuration, nothing fancy here:
> 
> node node1
> node node2
> property $id="cib-bootstrap-options" \
>       dc-version="1.1.5-5bd2b9154d7d9f86d7f56fe0a74072a5a6590c60" \
>       cluster-infrastructure="openais" \
>       expected-quorum-votes="2" \
>       stonith-enabled="false" \
>       no-quorum-policy="ignore"
> rsc_defaults $id="rsc-options" \
>       resource-stickiness="100"
> 
> I am running corosync 1.3.1, openais 1.1.4 and pacemaker 1.1.5.
> 
> I see this a lot too on a different cluster on this XEN host.
> 
> What does "[TOTEM ] FAILED TO RECEIVE" exactly mean and why does the cluster 
> get so unstable after that?
> 
> Thank you,
> Sascha
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] crm crashes after: TOTEM failed to receive

Reply via email to