I have a problem with heartbeat dying. I have a 3 node cluster running HA 2.0.8 on Fedora Core 1. They are providing a single IP address resource. They are using eth0 as the heartbeat mechanism. If I disconnect the eth0 cable from the node which is providing the IP address, one of the other nodes correctly begins providing it. However, shortly after disconnecting the eth0 cable, the heartbeat process (and others) die. The key area in the ha-debug log looks like the following:

pengine[4293]: 2008/01/11_09:50:22 info: determine_online_status: Node loneranger.us.big.net is online pengine[4293]: 2008/01/11_09:50:22 info: native_print: SharedIP (heartbeat::ocf:IPaddr): Started loneranger.us.big.net pengine[4293]: 2008/01/11_09:50:22 notice: StopRsc: loneranger.us.big.net Stop SharedIP crmd[9543]: 2008/01/11_09:50:22 info: do_state_transition: loneranger.us.big.net: State transition S_POLICY_ENGINE ->S_TRANSITION_ENGINE [input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=route_message ] pengine[4293]: 2008/01/11_09:50:22 info: process_pe_message: Transition 0: PEngine Input stored in: /var/lib/heartbeat/pengine/pe-input-137.bz2 tengine[4292]: 2008/01/11_09:50:22 info: unpack_graph: Unpacked transition 0: 1 actions in 1 synapses tengine[4292]: 2008/01/11_09:50:22 info: send_rsc_command: Initiating action 3: SharedIP_stop_0 on loneranger.us.big.net crmd[9543]: 2008/01/11_09:50:22 info: do_lrm_rsc_op: Performing op=SharedIP_stop_0 key=3:0:994066a9-4cae-49a4-abad-37f3e0b84b3e) IPaddr[4300]: 2008/01/11_09:50:22 INFO: /sbin/ifconfig eth0:0 10.1.2.50 down lrmd[9540]: 2008/01/11_09:50:22 info: RA output: (SharedIP:stop:stderr) SIOCDELRT: No such process

crmd[9543]: 2008/01/11_09:50:22 info: process_lrm_event: LRM operation SharedIP_stop_0 (call=4, rc=0) complete cib[9539]: 2008/01/11_09:50:22 info: cib_diff_notify: Update (client: 9543, call:32): 0.30.317 -> 0.30.318 (ok) cib[4315]: 2008/01/11_09:50:22 info: write_cib_contents: Wrote version 0.30.318 of the CIB to disk (digest: ad7329b3cddc6a9bbd96deb332a3d08f) tengine[4292]: 2008/01/11_09:50:22 info: te_update_diff: Processing diff (cib_update): 0.30.317 -> 0.30.318 tengine[4292]: 2008/01/11_09:50:22 info: match_graph_event: Action SharedIP_stop_0 (3) confirmed on c8608d41-66b2-4115-9043-4a8423b0d562 tengine[4292]: 2008/01/11_09:50:22 info: run_graph: Transition 0: (Complete=1, Pending=0, Fired=0, Skipped=0, Incomplete=0) tengine[4292]: 2008/01/11_09:50:22 info: notify_crmd: Transition 0 status: te_complete - <null> crmd[9543]: 2008/01/11_09:50:22 info: do_state_transition: loneranger.us.big.net: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_IPC_MESSAGE origin=route_message ] heartbeat[9527]: 2008/01/11_09:54:27 ERROR: Cannot write to media pipe 0: Resource temporarily unavailable
heartbeat[9527]: 2008/01/11_09:54:27 ERROR: Shutting down.
heartbeat[9527]: 2008/01/11_09:54:27 ERROR: Cannot write to media pipe 0: Resource temporarily unavailable
heartbeat[9527]: 2008/01/11_09:54:27 ERROR: Shutting down.
heartbeat[9527]: 2008/01/11_09:54:27 ERROR: Cannot write to media pipe 0: Resource temporarily unavailable
heartbeat[9527]: 2008/01/11_09:54:27 ERROR: Shutting down.

The last messages repeat for a very long time then most daemons eventually stop.


_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to