Hi,

On Wed, Feb 27, 2008 at 11:45:10AM -0500, Tao Yu wrote:
> Running heartbeat 2.1.2 on Centos4.5
> The cluster has just two nodes,  boxqaha1 and boxqaha2
> 
> Leave the HA resources running boxqaha1 for some time and found out the
> boxqaha2 took over resources for no reason. Checked the heartbeat on
> boxqaha1, found out only lrmd was still running on boxqaha1.
> 
> Following is the section of log file when error happened.
> 
> cib[26922]: 2008/02/27_06:44:37 info: cib_ccm_msg_callback: PEER:
> boxqaha1.cybervisiontech.com.ua
> cib[26922]: 2008/02/27_06:44:54 info: cib_stats: Processed 1 operations (
> 100000.00us average, 0% utilization) in the last 10min
> heartbeat[26913]: 2008/02/27_07:46:37 WARN: Gmain_timeout_dispatch: Dispatch
> function for send local status was delayed 14130 ms (> 510 ms) before being
> called (GSource: 0x649118)
> heartbeat[26913]: 2008/02/27_07:46:37 info: Gmain_timeout_dispatch: started
> at 654275520 should have started at 654274107
> heartbeat[26913]: 2008/02/27_07:46:37 WARN: Late heartbeat: Node
> boxqaha1.cybervisiontech.com.ua: interval 15130 ms
> heartbeat[26913]: 2008/02/27_07:46:37 WARN: Late heartbeat: Node
> boxqaha2.cybervisiontech.com.ua: interval 15280 ms
> heartbeat[26913]: 2008/02/27_07:46:37 WARN: Gmain_timeout_dispatch: Dispatch
> function for check for signals was delayed 14130 ms (> 510 ms) before being
> called (GSource: 0x649f88)
> heartbeat[26913]: 2008/02/27_07:46:37 info: Gmain_timeout_dispatch: started
> at 654275520 should have started at 654274107
> heartbeat[26913]: 2008/02/27_07:46:37 WARN: Gmain_timeout_dispatch: Dispatch
> function for update msgfree count was delayed 14270 ms (> 5000 ms) before
> being called (GSource: 0x64a0b8)
> heartbeat[26913]: 2008/02/27_07:46:37 info: Gmain_timeout_dispatch: started
> at 654275520 should have started at 654274093
> heartbeat[26913]: 2008/02/27_07:46:37 WARN: Gmain_timeout_dispatch: Dispatch
> function for client audit was delayed 11860 ms (> 5000 ms) before being
> called (GSource: 0x649bb8)
> heartbeat[26913]: 2008/02/27_07:46:37 info: Gmain_timeout_dispatch: started
> at 654275520 should have started at 654274334
> heartbeat[26913]: 2008/02/27_07:46:38 WARN: 413 lost packet(s) for [
> boxqaha2.cybervisiontech.com.ua] [142807:143221]
> heartbeat[26913]: 2008/02/27_07:46:38 ERROR: Cannot write to media pipe 0:
> Resource temporarily unavailable
> heartbeat[26913]: 2008/02/27_07:46:38 ERROR: Shutting down.
> heartbeat[26913]: 2008/02/27_07:46:38 ERROR: Cannot write to media pipe 0:
> Resource temporarily unavailable
> heartbeat[26913]: 2008/02/27_07:46:38 ERROR: Shutting down.
>       ... ... ...
> heartbeat[26913]: 2008/02/27_07:46:38 ERROR: Cannot write to media pipe 0:
> Resource temporarily unavailable
> heartbeat[26913]: 2008/02/27_07:46:38 ERROR: Shutting down.
> heartbeat[26913]: 2008/02/27_07:46:38 info: all clients are now paused
> heartbeat[26913]: 2008/02/27_07:46:38 debug: hist->ackseq =142199
> heartbeat[26913]: 2008/02/27_07:46:38 debug: hist->lowseq =142194,
> hist->hiseq=142300
> heartbeat[26913]: 2008/02/27_07:46:38 debug: At max 199 pkts missing from
> boxqaha2.cybervisiontech.com.ua
> heartbeat[26913]: 2008/02/27_07:46:38 debug: 0: missing pkt: 143208
> heartbeat[26913]: 2008/02/27_07:46:38 debug: 1: missing pkt: 143209
> heartbeat[26913]: 2008/02/27_07:46:38 debug: 2: missing pkt: 143210
> heartbeat[26913]: 2008/02/27_07:46:38 debug: 3: missing pkt: 143211
> heartbeat[26913]: 2008/02/27_07:46:38 debug: 4: missing pkt: 143212
> heartbeat[26913]: 2008/02/27_07:46:38 debug: 5: missing pkt: 143213
>    ... ... ...
> heartbeat[26913]: 2008/02/27_07:46:38 debug: expecting from
> boxqaha2.cybervisiontech.com.ua
> heartbeat[26913]: 2008/02/27_07:46:38 debug: it's ackseq=142199
> heartbeat[26913]: 2008/02/27_07:46:38 debug:
> heartbeat[26913]: 2008/02/27_07:46:38 info: killing
> /usr/lib64/heartbeat/mgmtd -v process group 26927 with signal 15
> heartbeat[26913]: 2008/02/27_07:46:38 debug: RUNNING Child client
> "/usr/lib64/heartbeat/ccm" (501,502) pid 26921
> heartbeat[26913]: 2008/02/27_07:46:38 debug: RUNNING Child client
> "/usr/lib64/heartbeat/cib" (501,502) pid 26922
> heartbeat[26913]: 2008/02/27_07:46:38 debug: RUNNING Child client
> "/usr/lib64/heartbeat/lrmd -r" (0,0) pid 26923
> heartbeat[26913]: 2008/02/27_07:46:38 debug: RUNNING Child client
> "/usr/lib64/heartbeat/stonithd" (0,0) pid 26924
> heartbeat[26913]: 2008/02/27_07:46:38 debug: RUNNING Child client
> "/usr/lib64/heartbeat/attrd" (501,502) pid 26925
> heartbeat[26913]: 2008/02/27_07:46:38 debug: RUNNING Child client
> "/usr/lib64/heartbeat/crmd" (501,502) pid 26926
> heartbeat[26913]: 2008/02/27_07:46:38 debug: RUNNING Child client
> "/usr/lib64/heartbeat/mgmtd -v" (0,0) pid 26927
> mgmtd[26927]: 2008/02/27_07:46:38 info: mgmtd is shutting down
> heartbeat[26916]: 2008/02/27_07:46:38 CRIT: Emergency Shutdown: Master
> Control process died.
> heartbeat[26916]: 2008/02/27_07:46:38 CRIT: Killing pid 26913 with SIGTERM
> heartbeat[26916]: 2008/02/27_07:46:38 CRIT: Killing pid 26917 with SIGTERM
> heartbeat[26916]: 2008/02/27_07:46:38 CRIT: Killing pid 26918 with SIGTERM
> heartbeat[26916]: 2008/02/27_07:46:38 CRIT: Emergency Shutdown(MCP dead):
> Killing ourselves.
> mgmtd[26927]: 2008/02/27_07:46:41 ERROR: Connection to the CIB terminated...
> exiting
> ccm[26921]: 2008/02/27_07:46:41 ERROR: Lost connection to heartbeat service.
> Need to bail out.
> stonithd[26924]: 2008/02/27_07:46:41 ERROR: Disconnected with heartbeat
> daemon
> attrd[26925]: 2008/02/27_07:46:41 CRIT: attrd_ha_dispatch: Lost connection
> to heartbeat service.
> cib[26922]: 2008/02/27_07:46:41 ERROR: cib_ha_connection_destroy: Heartbeat
> connection lost!  Exiting.
> crmd[26926]: 2008/02/27_07:46:41 CRIT: crmd_ha_msg_dispatch: Lost connection
> to heartbeat service.
> stonithd[26924]: 2008/02/27_07:46:41 notice: /usr/lib64/heartbeat/stonithd
> normally quit.
> attrd[26925]: 2008/02/27_07:46:41 CRIT: attrd_ha_connection_destroy: Lost
> connection to heartbeat service!
> cib[26922]: 2008/02/27_07:46:41 ERROR: crm_abort: main: Triggered non-fatal
> assert at main.c:213 : g_hash_table_size(client_list) == 0
> crmd[26926]: 2008/02/27_07:46:41 info: mem_handle_func:IPC broken, ccm is
> dead before the client!
> attrd[26925]: 2008/02/27_07:46:41 info: main: Exiting...
> cib[26922]: 2008/02/27_07:46:41 WARN: main: Not all clients gone at exit
> crmd[26926]: 2008/02/27_07:46:41 ERROR: ccm_dispatch: CCM connection appears
> to have failed: rc=-1.
> attrd[26925]: 2008/02/27_07:46:41 ERROR: cl_malloc: bucket size bug:
> 4294967393 bytes in 128 byte bucket #2
> cib[26922]: 2008/02/27_07:46:41 ERROR: cl_malloc: bucket size bug:
> 4294967391 bytes in 128 byte bucket #2
> crmd[26926]: 2008/02/27_07:46:44 ERROR: do_log: [[FSA]] Input I_ERROR from
> ccm_dispatch() received in state (S_NOT_DC)
> crmd[26926]: 2008/02/27_07:46:44 info: do_state_transition: State transition
> S_NOT_DC -> S_RECOVERY [ input=I_ERROR cause=C_CCM_CALLBACK
> origin=ccm_dispatch ]
> crmd[26926]: 2008/02/27_07:46:44 ERROR: do_recover: Action A_RECOVER
> (0000000001000000) not supported
> crmd[26926]: 2008/02/27_07:46:45 ERROR: do_log: [[FSA]] Input I_STOP from
> do_recover() received in state (S_RECOVERY)
> crmd[26926]: 2008/02/27_07:46:45 info: do_state_transition: State transition
> S_RECOVERY -> S_STOPPING [ input=I_STOP cause=C_FSA_INTERNAL
> origin=do_recover ]
> crmd[26926]: 2008/02/27_07:46:45 info: do_dc_release: DC role released
> crmd[26926]: 2008/02/27_07:46:45 WARN: do_log: [[FSA]] Input
> I_RELEASE_SUCCESS from do_dc_release() received in state (S_STOPPING)
> crmd[26926]: 2008/02/27_07:46:45 info: do_state_transition: State transition
> S_STOPPING -> S_TERMINATE [ input=I_TERMINATE cause=C_FSA_INTERNAL
> origin=do_shutdown ]
> crmd[26926]: 2008/02/27_07:46:45 info: verify_stopped: Checking for active
> resources before exit
> crmd[26926]: 2008/02/27_07:46:46 info: verify_stopped: Checking for active
> resources before exit
> crmd[26926]: 2008/02/27_07:46:46 info: do_lrm_control: Disconnected from the
> LRM
> crmd[26926]: 2008/02/27_07:46:46 info: do_ha_control: Disconnected from
> Heartbeat
> crmd[26926]: 2008/02/27_07:46:46 info: do_cib_control: Disconnecting CIB
> crmd[26926]: 2008/02/27_07:46:46 ERROR: send_ipc_message: IPC Channel to
> 26922 is not connected
> crmd[26926]: 2008/02/27_07:46:46 WARN: crm_log_message_adv: #=========
> IPC[outbound] message start ==========#
> crmd[26926]: 2008/02/27_07:46:46 WARN: MSG: Dumping message with 5 fields
> crmd[26926]: 2008/02/27_07:46:46 WARN: MSG[0] : [__name__=cib_command]
> crmd[26926]: 2008/02/27_07:46:46 WARN: MSG[1] : [t=cib]
> crmd[26926]: 2008/02/27_07:46:46 WARN: MSG[2] : [cib_op=cib_slave]
> crmd[26926]: 2008/02/27_07:46:46 WARN: MSG[3] : [cib_callid=149]
> crmd[26926]: 2008/02/27_07:46:46 WARN: MSG[4] : [cib_callopt=256]
> crmd[26926]: 2008/02/27_07:46:46 ERROR: cib_native_perform_op: Sending
> message to CIB service FAILED
> crmd[26926]: 2008/02/27_07:46:46 info: crmd_cib_connection_destroy:
> Connection to the CIB terminated...
> crmd[26926]: 2008/02/27_07:46:46 info: do_exit: Performing A_EXIT_0 -
> gracefully exiting the CRMd
> crmd[26926]: 2008/02/27_07:46:46 ERROR: do_exit: Could not recover from
> internal error
> crmd[26926]: 2008/02/27_07:46:46 info: free_mem: Dropping I_TERMINATE: [
> state=S_TERMINATE cause=C_FSA_INTERNAL origin=verify_stopped ]
> crmd[26926]: 2008/02/27_07:46:46 info: free_mem: Dropping I_TERMINATE: [
> state=S_TERMINATE cause=C_FSA_INTERNAL origin=verify_stopped ]
> crmd[26926]: 2008/02/27_07:46:46 info: do_exit: [crmd] stopped (2)
> 
> 
> "ERROR: Cannot write to media pipe 0: Resource temporarily unavailable"
> looks caused all this. What is this? the IP stack was messed up? But why
> heartbeat dies here?

The host was simply overwhelmed and couldn't keep the
communication. There's a known bug in 2.1.2 which is excercised
in case there are many and often monitor operations. Please
upgrade to 2.1.3.

Thanks,

Dejan

> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to