Hi, On Wed, Feb 27, 2008 at 11:45:10AM -0500, Tao Yu wrote: > Running heartbeat 2.1.2 on Centos4.5 > The cluster has just two nodes, boxqaha1 and boxqaha2 > > Leave the HA resources running boxqaha1 for some time and found out the > boxqaha2 took over resources for no reason. Checked the heartbeat on > boxqaha1, found out only lrmd was still running on boxqaha1. > > Following is the section of log file when error happened. > > cib[26922]: 2008/02/27_06:44:37 info: cib_ccm_msg_callback: PEER: > boxqaha1.cybervisiontech.com.ua > cib[26922]: 2008/02/27_06:44:54 info: cib_stats: Processed 1 operations ( > 100000.00us average, 0% utilization) in the last 10min > heartbeat[26913]: 2008/02/27_07:46:37 WARN: Gmain_timeout_dispatch: Dispatch > function for send local status was delayed 14130 ms (> 510 ms) before being > called (GSource: 0x649118) > heartbeat[26913]: 2008/02/27_07:46:37 info: Gmain_timeout_dispatch: started > at 654275520 should have started at 654274107 > heartbeat[26913]: 2008/02/27_07:46:37 WARN: Late heartbeat: Node > boxqaha1.cybervisiontech.com.ua: interval 15130 ms > heartbeat[26913]: 2008/02/27_07:46:37 WARN: Late heartbeat: Node > boxqaha2.cybervisiontech.com.ua: interval 15280 ms > heartbeat[26913]: 2008/02/27_07:46:37 WARN: Gmain_timeout_dispatch: Dispatch > function for check for signals was delayed 14130 ms (> 510 ms) before being > called (GSource: 0x649f88) > heartbeat[26913]: 2008/02/27_07:46:37 info: Gmain_timeout_dispatch: started > at 654275520 should have started at 654274107 > heartbeat[26913]: 2008/02/27_07:46:37 WARN: Gmain_timeout_dispatch: Dispatch > function for update msgfree count was delayed 14270 ms (> 5000 ms) before > being called (GSource: 0x64a0b8) > heartbeat[26913]: 2008/02/27_07:46:37 info: Gmain_timeout_dispatch: started > at 654275520 should have started at 654274093 > heartbeat[26913]: 2008/02/27_07:46:37 WARN: Gmain_timeout_dispatch: Dispatch > function for client audit was delayed 11860 ms (> 5000 ms) before being > called (GSource: 0x649bb8) > heartbeat[26913]: 2008/02/27_07:46:37 info: Gmain_timeout_dispatch: started > at 654275520 should have started at 654274334 > heartbeat[26913]: 2008/02/27_07:46:38 WARN: 413 lost packet(s) for [ > boxqaha2.cybervisiontech.com.ua] [142807:143221] > heartbeat[26913]: 2008/02/27_07:46:38 ERROR: Cannot write to media pipe 0: > Resource temporarily unavailable > heartbeat[26913]: 2008/02/27_07:46:38 ERROR: Shutting down. > heartbeat[26913]: 2008/02/27_07:46:38 ERROR: Cannot write to media pipe 0: > Resource temporarily unavailable > heartbeat[26913]: 2008/02/27_07:46:38 ERROR: Shutting down. > ... ... ... > heartbeat[26913]: 2008/02/27_07:46:38 ERROR: Cannot write to media pipe 0: > Resource temporarily unavailable > heartbeat[26913]: 2008/02/27_07:46:38 ERROR: Shutting down. > heartbeat[26913]: 2008/02/27_07:46:38 info: all clients are now paused > heartbeat[26913]: 2008/02/27_07:46:38 debug: hist->ackseq =142199 > heartbeat[26913]: 2008/02/27_07:46:38 debug: hist->lowseq =142194, > hist->hiseq=142300 > heartbeat[26913]: 2008/02/27_07:46:38 debug: At max 199 pkts missing from > boxqaha2.cybervisiontech.com.ua > heartbeat[26913]: 2008/02/27_07:46:38 debug: 0: missing pkt: 143208 > heartbeat[26913]: 2008/02/27_07:46:38 debug: 1: missing pkt: 143209 > heartbeat[26913]: 2008/02/27_07:46:38 debug: 2: missing pkt: 143210 > heartbeat[26913]: 2008/02/27_07:46:38 debug: 3: missing pkt: 143211 > heartbeat[26913]: 2008/02/27_07:46:38 debug: 4: missing pkt: 143212 > heartbeat[26913]: 2008/02/27_07:46:38 debug: 5: missing pkt: 143213 > ... ... ... > heartbeat[26913]: 2008/02/27_07:46:38 debug: expecting from > boxqaha2.cybervisiontech.com.ua > heartbeat[26913]: 2008/02/27_07:46:38 debug: it's ackseq=142199 > heartbeat[26913]: 2008/02/27_07:46:38 debug: > heartbeat[26913]: 2008/02/27_07:46:38 info: killing > /usr/lib64/heartbeat/mgmtd -v process group 26927 with signal 15 > heartbeat[26913]: 2008/02/27_07:46:38 debug: RUNNING Child client > "/usr/lib64/heartbeat/ccm" (501,502) pid 26921 > heartbeat[26913]: 2008/02/27_07:46:38 debug: RUNNING Child client > "/usr/lib64/heartbeat/cib" (501,502) pid 26922 > heartbeat[26913]: 2008/02/27_07:46:38 debug: RUNNING Child client > "/usr/lib64/heartbeat/lrmd -r" (0,0) pid 26923 > heartbeat[26913]: 2008/02/27_07:46:38 debug: RUNNING Child client > "/usr/lib64/heartbeat/stonithd" (0,0) pid 26924 > heartbeat[26913]: 2008/02/27_07:46:38 debug: RUNNING Child client > "/usr/lib64/heartbeat/attrd" (501,502) pid 26925 > heartbeat[26913]: 2008/02/27_07:46:38 debug: RUNNING Child client > "/usr/lib64/heartbeat/crmd" (501,502) pid 26926 > heartbeat[26913]: 2008/02/27_07:46:38 debug: RUNNING Child client > "/usr/lib64/heartbeat/mgmtd -v" (0,0) pid 26927 > mgmtd[26927]: 2008/02/27_07:46:38 info: mgmtd is shutting down > heartbeat[26916]: 2008/02/27_07:46:38 CRIT: Emergency Shutdown: Master > Control process died. > heartbeat[26916]: 2008/02/27_07:46:38 CRIT: Killing pid 26913 with SIGTERM > heartbeat[26916]: 2008/02/27_07:46:38 CRIT: Killing pid 26917 with SIGTERM > heartbeat[26916]: 2008/02/27_07:46:38 CRIT: Killing pid 26918 with SIGTERM > heartbeat[26916]: 2008/02/27_07:46:38 CRIT: Emergency Shutdown(MCP dead): > Killing ourselves. > mgmtd[26927]: 2008/02/27_07:46:41 ERROR: Connection to the CIB terminated... > exiting > ccm[26921]: 2008/02/27_07:46:41 ERROR: Lost connection to heartbeat service. > Need to bail out. > stonithd[26924]: 2008/02/27_07:46:41 ERROR: Disconnected with heartbeat > daemon > attrd[26925]: 2008/02/27_07:46:41 CRIT: attrd_ha_dispatch: Lost connection > to heartbeat service. > cib[26922]: 2008/02/27_07:46:41 ERROR: cib_ha_connection_destroy: Heartbeat > connection lost! Exiting. > crmd[26926]: 2008/02/27_07:46:41 CRIT: crmd_ha_msg_dispatch: Lost connection > to heartbeat service. > stonithd[26924]: 2008/02/27_07:46:41 notice: /usr/lib64/heartbeat/stonithd > normally quit. > attrd[26925]: 2008/02/27_07:46:41 CRIT: attrd_ha_connection_destroy: Lost > connection to heartbeat service! > cib[26922]: 2008/02/27_07:46:41 ERROR: crm_abort: main: Triggered non-fatal > assert at main.c:213 : g_hash_table_size(client_list) == 0 > crmd[26926]: 2008/02/27_07:46:41 info: mem_handle_func:IPC broken, ccm is > dead before the client! > attrd[26925]: 2008/02/27_07:46:41 info: main: Exiting... > cib[26922]: 2008/02/27_07:46:41 WARN: main: Not all clients gone at exit > crmd[26926]: 2008/02/27_07:46:41 ERROR: ccm_dispatch: CCM connection appears > to have failed: rc=-1. > attrd[26925]: 2008/02/27_07:46:41 ERROR: cl_malloc: bucket size bug: > 4294967393 bytes in 128 byte bucket #2 > cib[26922]: 2008/02/27_07:46:41 ERROR: cl_malloc: bucket size bug: > 4294967391 bytes in 128 byte bucket #2 > crmd[26926]: 2008/02/27_07:46:44 ERROR: do_log: [[FSA]] Input I_ERROR from > ccm_dispatch() received in state (S_NOT_DC) > crmd[26926]: 2008/02/27_07:46:44 info: do_state_transition: State transition > S_NOT_DC -> S_RECOVERY [ input=I_ERROR cause=C_CCM_CALLBACK > origin=ccm_dispatch ] > crmd[26926]: 2008/02/27_07:46:44 ERROR: do_recover: Action A_RECOVER > (0000000001000000) not supported > crmd[26926]: 2008/02/27_07:46:45 ERROR: do_log: [[FSA]] Input I_STOP from > do_recover() received in state (S_RECOVERY) > crmd[26926]: 2008/02/27_07:46:45 info: do_state_transition: State transition > S_RECOVERY -> S_STOPPING [ input=I_STOP cause=C_FSA_INTERNAL > origin=do_recover ] > crmd[26926]: 2008/02/27_07:46:45 info: do_dc_release: DC role released > crmd[26926]: 2008/02/27_07:46:45 WARN: do_log: [[FSA]] Input > I_RELEASE_SUCCESS from do_dc_release() received in state (S_STOPPING) > crmd[26926]: 2008/02/27_07:46:45 info: do_state_transition: State transition > S_STOPPING -> S_TERMINATE [ input=I_TERMINATE cause=C_FSA_INTERNAL > origin=do_shutdown ] > crmd[26926]: 2008/02/27_07:46:45 info: verify_stopped: Checking for active > resources before exit > crmd[26926]: 2008/02/27_07:46:46 info: verify_stopped: Checking for active > resources before exit > crmd[26926]: 2008/02/27_07:46:46 info: do_lrm_control: Disconnected from the > LRM > crmd[26926]: 2008/02/27_07:46:46 info: do_ha_control: Disconnected from > Heartbeat > crmd[26926]: 2008/02/27_07:46:46 info: do_cib_control: Disconnecting CIB > crmd[26926]: 2008/02/27_07:46:46 ERROR: send_ipc_message: IPC Channel to > 26922 is not connected > crmd[26926]: 2008/02/27_07:46:46 WARN: crm_log_message_adv: #========= > IPC[outbound] message start ==========# > crmd[26926]: 2008/02/27_07:46:46 WARN: MSG: Dumping message with 5 fields > crmd[26926]: 2008/02/27_07:46:46 WARN: MSG[0] : [__name__=cib_command] > crmd[26926]: 2008/02/27_07:46:46 WARN: MSG[1] : [t=cib] > crmd[26926]: 2008/02/27_07:46:46 WARN: MSG[2] : [cib_op=cib_slave] > crmd[26926]: 2008/02/27_07:46:46 WARN: MSG[3] : [cib_callid=149] > crmd[26926]: 2008/02/27_07:46:46 WARN: MSG[4] : [cib_callopt=256] > crmd[26926]: 2008/02/27_07:46:46 ERROR: cib_native_perform_op: Sending > message to CIB service FAILED > crmd[26926]: 2008/02/27_07:46:46 info: crmd_cib_connection_destroy: > Connection to the CIB terminated... > crmd[26926]: 2008/02/27_07:46:46 info: do_exit: Performing A_EXIT_0 - > gracefully exiting the CRMd > crmd[26926]: 2008/02/27_07:46:46 ERROR: do_exit: Could not recover from > internal error > crmd[26926]: 2008/02/27_07:46:46 info: free_mem: Dropping I_TERMINATE: [ > state=S_TERMINATE cause=C_FSA_INTERNAL origin=verify_stopped ] > crmd[26926]: 2008/02/27_07:46:46 info: free_mem: Dropping I_TERMINATE: [ > state=S_TERMINATE cause=C_FSA_INTERNAL origin=verify_stopped ] > crmd[26926]: 2008/02/27_07:46:46 info: do_exit: [crmd] stopped (2) > > > "ERROR: Cannot write to media pipe 0: Resource temporarily unavailable" > looks caused all this. What is this? the IP stack was messed up? But why > heartbeat dies here?
The host was simply overwhelmed and couldn't keep the communication. There's a known bug in 2.1.2 which is excercised in case there are many and often monitor operations. Please upgrade to 2.1.3. Thanks, Dejan > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
