[Linux-HA] Heartbeat dies

Tao Yu Wed, 27 Feb 2008 19:57:01 -0800

Running heartbeat 2.1.2 on Centos4.5
The cluster has just two nodes,  boxqaha1 and boxqaha2


Leave the HA resources running boxqaha1 for some time and found out the
boxqaha2 took over resources for no reason. Checked the heartbeat on
boxqaha1, found out only lrmd was still running on boxqaha1.

Following is the section of log file when error happened.

cib[26922]: 2008/02/27_06:44:37 info: cib_ccm_msg_callback: PEER:
boxqaha1.cybervisiontech.com.ua
cib[26922]: 2008/02/27_06:44:54 info: cib_stats: Processed 1 operations (
100000.00us average, 0% utilization) in the last 10min
heartbeat[26913]: 2008/02/27_07:46:37 WARN: Gmain_timeout_dispatch: Dispatch
function for send local status was delayed 14130 ms (> 510 ms) before being
called (GSource: 0x649118)
heartbeat[26913]: 2008/02/27_07:46:37 info: Gmain_timeout_dispatch: started
at 654275520 should have started at 654274107
heartbeat[26913]: 2008/02/27_07:46:37 WARN: Late heartbeat: Node
boxqaha1.cybervisiontech.com.ua: interval 15130 ms
heartbeat[26913]: 2008/02/27_07:46:37 WARN: Late heartbeat: Node
boxqaha2.cybervisiontech.com.ua: interval 15280 ms
heartbeat[26913]: 2008/02/27_07:46:37 WARN: Gmain_timeout_dispatch: Dispatch
function for check for signals was delayed 14130 ms (> 510 ms) before being
called (GSource: 0x649f88)
heartbeat[26913]: 2008/02/27_07:46:37 info: Gmain_timeout_dispatch: started
at 654275520 should have started at 654274107
heartbeat[26913]: 2008/02/27_07:46:37 WARN: Gmain_timeout_dispatch: Dispatch
function for update msgfree count was delayed 14270 ms (> 5000 ms) before
being called (GSource: 0x64a0b8)
heartbeat[26913]: 2008/02/27_07:46:37 info: Gmain_timeout_dispatch: started
at 654275520 should have started at 654274093
heartbeat[26913]: 2008/02/27_07:46:37 WARN: Gmain_timeout_dispatch: Dispatch
function for client audit was delayed 11860 ms (> 5000 ms) before being
called (GSource: 0x649bb8)
heartbeat[26913]: 2008/02/27_07:46:37 info: Gmain_timeout_dispatch: started
at 654275520 should have started at 654274334
heartbeat[26913]: 2008/02/27_07:46:38 WARN: 413 lost packet(s) for [
boxqaha2.cybervisiontech.com.ua] [142807:143221]
heartbeat[26913]: 2008/02/27_07:46:38 ERROR: Cannot write to media pipe 0:
Resource temporarily unavailable
heartbeat[26913]: 2008/02/27_07:46:38 ERROR: Shutting down.
heartbeat[26913]: 2008/02/27_07:46:38 ERROR: Cannot write to media pipe 0:
Resource temporarily unavailable
heartbeat[26913]: 2008/02/27_07:46:38 ERROR: Shutting down.
      ... ... ...
heartbeat[26913]: 2008/02/27_07:46:38 ERROR: Cannot write to media pipe 0:
Resource temporarily unavailable
heartbeat[26913]: 2008/02/27_07:46:38 ERROR: Shutting down.
heartbeat[26913]: 2008/02/27_07:46:38 info: all clients are now paused
heartbeat[26913]: 2008/02/27_07:46:38 debug: hist->ackseq =142199
heartbeat[26913]: 2008/02/27_07:46:38 debug: hist->lowseq =142194,
hist->hiseq=142300
heartbeat[26913]: 2008/02/27_07:46:38 debug: At max 199 pkts missing from
boxqaha2.cybervisiontech.com.ua
heartbeat[26913]: 2008/02/27_07:46:38 debug: 0: missing pkt: 143208
heartbeat[26913]: 2008/02/27_07:46:38 debug: 1: missing pkt: 143209
heartbeat[26913]: 2008/02/27_07:46:38 debug: 2: missing pkt: 143210
heartbeat[26913]: 2008/02/27_07:46:38 debug: 3: missing pkt: 143211
heartbeat[26913]: 2008/02/27_07:46:38 debug: 4: missing pkt: 143212
heartbeat[26913]: 2008/02/27_07:46:38 debug: 5: missing pkt: 143213
   ... ... ...
heartbeat[26913]: 2008/02/27_07:46:38 debug: expecting from
boxqaha2.cybervisiontech.com.ua
heartbeat[26913]: 2008/02/27_07:46:38 debug: it's ackseq=142199
heartbeat[26913]: 2008/02/27_07:46:38 debug:
heartbeat[26913]: 2008/02/27_07:46:38 info: killing
/usr/lib64/heartbeat/mgmtd -v process group 26927 with signal 15
heartbeat[26913]: 2008/02/27_07:46:38 debug: RUNNING Child client
"/usr/lib64/heartbeat/ccm" (501,502) pid 26921
heartbeat[26913]: 2008/02/27_07:46:38 debug: RUNNING Child client
"/usr/lib64/heartbeat/cib" (501,502) pid 26922
heartbeat[26913]: 2008/02/27_07:46:38 debug: RUNNING Child client
"/usr/lib64/heartbeat/lrmd -r" (0,0) pid 26923
heartbeat[26913]: 2008/02/27_07:46:38 debug: RUNNING Child client
"/usr/lib64/heartbeat/stonithd" (0,0) pid 26924
heartbeat[26913]: 2008/02/27_07:46:38 debug: RUNNING Child client
"/usr/lib64/heartbeat/attrd" (501,502) pid 26925
heartbeat[26913]: 2008/02/27_07:46:38 debug: RUNNING Child client
"/usr/lib64/heartbeat/crmd" (501,502) pid 26926
heartbeat[26913]: 2008/02/27_07:46:38 debug: RUNNING Child client
"/usr/lib64/heartbeat/mgmtd -v" (0,0) pid 26927
mgmtd[26927]: 2008/02/27_07:46:38 info: mgmtd is shutting down
heartbeat[26916]: 2008/02/27_07:46:38 CRIT: Emergency Shutdown: Master
Control process died.
heartbeat[26916]: 2008/02/27_07:46:38 CRIT: Killing pid 26913 with SIGTERM
heartbeat[26916]: 2008/02/27_07:46:38 CRIT: Killing pid 26917 with SIGTERM
heartbeat[26916]: 2008/02/27_07:46:38 CRIT: Killing pid 26918 with SIGTERM
heartbeat[26916]: 2008/02/27_07:46:38 CRIT: Emergency Shutdown(MCP dead):
Killing ourselves.
mgmtd[26927]: 2008/02/27_07:46:41 ERROR: Connection to the CIB terminated...
exiting
ccm[26921]: 2008/02/27_07:46:41 ERROR: Lost connection to heartbeat service.
Need to bail out.
stonithd[26924]: 2008/02/27_07:46:41 ERROR: Disconnected with heartbeat
daemon
attrd[26925]: 2008/02/27_07:46:41 CRIT: attrd_ha_dispatch: Lost connection
to heartbeat service.
cib[26922]: 2008/02/27_07:46:41 ERROR: cib_ha_connection_destroy: Heartbeat
connection lost!  Exiting.
crmd[26926]: 2008/02/27_07:46:41 CRIT: crmd_ha_msg_dispatch: Lost connection
to heartbeat service.
stonithd[26924]: 2008/02/27_07:46:41 notice: /usr/lib64/heartbeat/stonithd
normally quit.
attrd[26925]: 2008/02/27_07:46:41 CRIT: attrd_ha_connection_destroy: Lost
connection to heartbeat service!
cib[26922]: 2008/02/27_07:46:41 ERROR: crm_abort: main: Triggered non-fatal
assert at main.c:213 : g_hash_table_size(client_list) == 0
crmd[26926]: 2008/02/27_07:46:41 info: mem_handle_func:IPC broken, ccm is
dead before the client!
attrd[26925]: 2008/02/27_07:46:41 info: main: Exiting...
cib[26922]: 2008/02/27_07:46:41 WARN: main: Not all clients gone at exit
crmd[26926]: 2008/02/27_07:46:41 ERROR: ccm_dispatch: CCM connection appears
to have failed: rc=-1.
attrd[26925]: 2008/02/27_07:46:41 ERROR: cl_malloc: bucket size bug:
4294967393 bytes in 128 byte bucket #2
cib[26922]: 2008/02/27_07:46:41 ERROR: cl_malloc: bucket size bug:
4294967391 bytes in 128 byte bucket #2
crmd[26926]: 2008/02/27_07:46:44 ERROR: do_log: [[FSA]] Input I_ERROR from
ccm_dispatch() received in state (S_NOT_DC)
crmd[26926]: 2008/02/27_07:46:44 info: do_state_transition: State transition
S_NOT_DC -> S_RECOVERY [ input=I_ERROR cause=C_CCM_CALLBACK
origin=ccm_dispatch ]
crmd[26926]: 2008/02/27_07:46:44 ERROR: do_recover: Action A_RECOVER
(0000000001000000) not supported
crmd[26926]: 2008/02/27_07:46:45 ERROR: do_log: [[FSA]] Input I_STOP from
do_recover() received in state (S_RECOVERY)
crmd[26926]: 2008/02/27_07:46:45 info: do_state_transition: State transition
S_RECOVERY -> S_STOPPING [ input=I_STOP cause=C_FSA_INTERNAL
origin=do_recover ]
crmd[26926]: 2008/02/27_07:46:45 info: do_dc_release: DC role released
crmd[26926]: 2008/02/27_07:46:45 WARN: do_log: [[FSA]] Input
I_RELEASE_SUCCESS from do_dc_release() received in state (S_STOPPING)
crmd[26926]: 2008/02/27_07:46:45 info: do_state_transition: State transition
S_STOPPING -> S_TERMINATE [ input=I_TERMINATE cause=C_FSA_INTERNAL
origin=do_shutdown ]
crmd[26926]: 2008/02/27_07:46:45 info: verify_stopped: Checking for active
resources before exit
crmd[26926]: 2008/02/27_07:46:46 info: verify_stopped: Checking for active
resources before exit
crmd[26926]: 2008/02/27_07:46:46 info: do_lrm_control: Disconnected from the
LRM
crmd[26926]: 2008/02/27_07:46:46 info: do_ha_control: Disconnected from
Heartbeat
crmd[26926]: 2008/02/27_07:46:46 info: do_cib_control: Disconnecting CIB
crmd[26926]: 2008/02/27_07:46:46 ERROR: send_ipc_message: IPC Channel to
26922 is not connected
crmd[26926]: 2008/02/27_07:46:46 WARN: crm_log_message_adv: #=========
IPC[outbound] message start ==========#
crmd[26926]: 2008/02/27_07:46:46 WARN: MSG: Dumping message with 5 fields
crmd[26926]: 2008/02/27_07:46:46 WARN: MSG[0] : [__name__=cib_command]
crmd[26926]: 2008/02/27_07:46:46 WARN: MSG[1] : [t=cib]
crmd[26926]: 2008/02/27_07:46:46 WARN: MSG[2] : [cib_op=cib_slave]
crmd[26926]: 2008/02/27_07:46:46 WARN: MSG[3] : [cib_callid=149]
crmd[26926]: 2008/02/27_07:46:46 WARN: MSG[4] : [cib_callopt=256]
crmd[26926]: 2008/02/27_07:46:46 ERROR: cib_native_perform_op: Sending
message to CIB service FAILED
crmd[26926]: 2008/02/27_07:46:46 info: crmd_cib_connection_destroy:
Connection to the CIB terminated...
crmd[26926]: 2008/02/27_07:46:46 info: do_exit: Performing A_EXIT_0 -
gracefully exiting the CRMd
crmd[26926]: 2008/02/27_07:46:46 ERROR: do_exit: Could not recover from
internal error
crmd[26926]: 2008/02/27_07:46:46 info: free_mem: Dropping I_TERMINATE: [
state=S_TERMINATE cause=C_FSA_INTERNAL origin=verify_stopped ]
crmd[26926]: 2008/02/27_07:46:46 info: free_mem: Dropping I_TERMINATE: [
state=S_TERMINATE cause=C_FSA_INTERNAL origin=verify_stopped ]
crmd[26926]: 2008/02/27_07:46:46 info: do_exit: [crmd] stopped (2)


"ERROR: Cannot write to media pipe 0: Resource temporarily unavailable"
looks caused all this. What is this? the IP stack was messed up? But why
heartbeat dies here?
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

[Linux-HA] Heartbeat dies

Reply via email to