Hi, We have a two nodes setup running heartbeat version 2.0.8-1. On one node, heartbeat exited saying Emergency Shutdown. It was restarted. After the restart, the heartbeat on the other node exited giving roughly the same reason. Can someone please help us identify the issue. If these are known bugs and if those bugs have been fixed in later releases?
Any help would be greatly appreciated. The nodes configuration: sh-3.00# uname -a Linux S-FL2-PLS-NAC 2.6.17-1.2142_FC4smp #1 SMP Sat Aug 12 08:16:08 EDT 2006 i686 i686 i386 GNU/Linux Following are the logs from the first node: Mar 3 14:47:05 S-FL2-PLS-NAC heartbeat: [5057]: ERROR: Message hist queue is filling up (197 messages in queue) Mar 3 14:47:05 S-FL2-PLS-NAC heartbeat: [5057]: ERROR: Message hist queue is filling up (198 messages in queue) Mar 3 14:47:06 S-FL2-PLS-NAC heartbeat: [5057]: ERROR: Message hist queue is filling up (199 messages in queue) Mar 3 14:47:06 S-FL2-PLS-NAC heartbeat: [5057]: ERROR: Message hist queue is filling up (200 messages in queue) Mar 3 14:47:10 S-FL2-PLS-NAC last message repeated 7 times Mar 3 14:47:10 S-FL2-PLS-NAC heartbeat: [5057]: ERROR: Cannot rexmit pkt 7 for s-fl2-sls-nac.yardi.com: seqno too low Mar 3 14:47:10 S-FL2-PLS-NAC heartbeat: [5057]: info: fromnode = s-fl2-sls-nac.yardi.com, fromnode's ackseq = 0 Mar 3 14:47:10 S-FL2-PLS-NAC heartbeat: [5057]: info: hist information: Mar 3 14:47:10 S-FL2-PLS-NAC heartbeat: [5057]: info: hiseq =207, lowseq=7,ackseq=0,lastmsg=6 Mar 3 14:47:10 S-FL2-PLS-NAC heartbeat: [5057]: ERROR: Cannot rexmit pkt 7 for s-fl2-sls-nac.yardi.com: seqno too low Mar 3 14:47:10 S-FL2-PLS-NAC heartbeat: [5057]: info: fromnode = s-fl2-sls-nac.yardi.com, fromnode's ackseq = 0 Mar 3 14:47:10 S-FL2-PLS-NAC heartbeat: [5057]: info: hist information: Mar 3 14:47:10 S-FL2-PLS-NAC heartbeat: [5057]: info: hiseq =207, lowseq=7,ackseq=0,lastmsg=6 Mar 3 14:47:10 S-FL2-PLS-NAC heartbeat: [5057]: ERROR: Message hist queue is filling up (200 messages in queue) Mar 3 14:47:10 S-FL2-PLS-NAC heartbeat: [5057]: ERROR: Cannot rexmit pkt 8 for s-fl2-sls-nac.yardi.com: seqno too low Mar 3 14:47:10 S-FL2-PLS-NAC heartbeat: [5057]: info: fromnode = s-fl2-sls-nac.yardi.com, fromnode's ackseq = 0 Mar 3 14:47:10 S-FL2-PLS-NAC heartbeat: [5057]: info: hist information: Mar 3 14:47:10 S-FL2-PLS-NAC heartbeat: [5057]: info: hiseq =208, lowseq=8,ackseq=0,lastmsg=7 Mar 3 14:47:10 S-FL2-PLS-NAC heartbeat: [5057]: ERROR: Cannot rexmit pkt 8 for s-fl2-sls-nac.yardi.com: seqno too low Mar 3 14:47:10 S-FL2-PLS-NAC heartbeat: [5057]: info: fromnode = s-fl2-sls-nac.yardi.com, fromnode's ackseq = 0 Mar 3 14:47:10 S-FL2-PLS-NAC heartbeat: [5057]: info: hist information: Mar 3 14:47:10 S-FL2-PLS-NAC heartbeat: [5057]: info: hiseq =208, lowseq=8,ackseq=0,lastmsg=7 Mar 3 14:47:10 S-FL2-PLS-NAC heartbeat: [5057]: ERROR: Cannot rexmit pkt 8 for s-fl2-sls-nac.yardi.com: seqno too low Mar 3 14:47:10 S-FL2-PLS-NAC heartbeat: [5057]: info: fromnode = s-fl2-sls-nac.yardi.com, fromnode's ackseq = 0 Mar 3 14:47:10 S-FL2-PLS-NAC heartbeat: [5057]: info: hist information: Mar 3 14:47:10 S-FL2-PLS-NAC heartbeat: [5057]: info: hiseq =208, lowseq=8,ackseq=0,lastmsg=7 Mar 3 14:47:10 S-FL2-PLS-NAC heartbeat: [5057]: ERROR: Cannot rexmit pkt 8 for s-fl2-sls-nac.yardi.com: seqno too low Mar 3 14:47:10 S-FL2-PLS-NAC heartbeat: [5057]: info: fromnode = s-fl2-sls-nac.yardi.com, fromnode's ackseq = 0 Mar 3 14:47:10 S-FL2-PLS-NAC heartbeat: [5057]: info: hist information: Mar 3 14:47:10 S-FL2-PLS-NAC heartbeat: [5057]: info: hiseq =208, lowseq=8,ackseq=0,lastmsg=7 Mar 3 14:47:10 S-FL2-PLS-NAC heartbeat: [5057]: ERROR: lowseq cannnot be greater than ackseq Mar 3 14:47:10 S-FL2-PLS-NAC heartbeat: [5057]: info: hist->ackseq =10, old_ackseq=0 Mar 3 14:47:10 S-FL2-PLS-NAC heartbeat: [5057]: info: hist->lowseq =201, hist->hiseq=208, send_cluster_msg_level=0 Mar 3 14:47:10 S-FL2-PLS-NAC ccm: [5284]: ERROR: Lost connection to heartbeat service. Need to bail out. Mar 3 14:47:10 S-FL2-PLS-NAC cib: [5285]: ERROR: cib_ha_connection_destroy: Heartbeat connection lost! Exiting. Mar 3 14:47:10 S-FL2-PLS-NAC stonithd: [5287]: ERROR: Disconnected with heartbeat daemon Mar 3 14:47:10 S-FL2-PLS-NAC crmd: [5289]: CRIT: crmd_ha_msg_dispatch: Lost connection to heartbeat service. Mar 3 14:47:10 S-FL2-PLS-NAC mgmtd: [5290]: ERROR: Lost connection to heartbeat service. Mar 3 14:47:10 S-FL2-PLS-NAC stonithd: [5287]: notice: /usr/lib/heartbeat/stonithd normally quit. Mar 3 14:47:10 S-FL2-PLS-NAC crmd: [5289]: info: mem_handle_func:IPC broken, ccm is dead before the client! Mar 3 14:47:10 S-FL2-PLS-NAC attrd: [5288]: CRIT: attrd_ha_dispatch: Lost connection to heartbeat service. Mar 3 14:47:10 S-FL2-PLS-NAC crmd: [5289]: ERROR: ccm_dispatch: CCM connection appears to have failed: rc=-1. Mar 3 14:47:10 S-FL2-PLS-NAC attrd: [5288]: CRIT: attrd_ha_connection_destroy: Lost connection to heartbeat service! Mar 3 14:47:10 S-FL2-PLS-NAC crmd: [5289]: ERROR: do_log: [[FSA]] Input I_ERROR from ccm_dispatch() received in state (S_PENDING) Mar 3 14:47:10 S-FL2-PLS-NAC crmd: [5289]: info: do_state_transition: s-fl2-pls-nac.yardi.com: State transition S_PENDING -> S_RECOVERY [ input=I_ERROR cause=C_CCM_CALLBACK origin=ccm_dispatch ] Mar 3 14:47:10 S-FL2-PLS-NAC cib: [5285]: info: uninitializeCib: The CIB has been deallocated. Mar 3 14:47:10 S-FL2-PLS-NAC crmd: [5289]: ERROR: do_recover: Action A_RECOVER (0000000001000000) not supported Mar 3 14:47:10 S-FL2-PLS-NAC crmd: [5289]: ERROR: do_log: [[FSA]] Input I_STOP from do_recover() received in state (S_RECOVERY) Mar 3 14:47:10 S-FL2-PLS-NAC crmd: [5289]: info: do_state_transition: s-fl2-pls-nac.yardi.com: State transition S_RECOVERY -> S_STOPPING [ input=I_STOP cause=C_FSA_INTERNAL origin=do_recover ] Mar 3 14:47:10 S-FL2-PLS-NAC crmd: [5289]: info: do_dc_release: DC role released Mar 3 14:47:10 S-FL2-PLS-NAC crmd: [5289]: WARN: do_log: [[FSA]] Input I_RELEASE_SUCCESS from do_dc_release() received in state (S_STOPPING) Mar 3 14:47:10 S-FL2-PLS-NAC crmd: [5289]: info: do_state_transition: s-fl2-pls-nac.yardi.com: State transition S_STOPPING -> S_TERMINATE [ input=I_TERMINATE cause=C_FSA_INTERNAL origin=do_shutdown ] Mar 3 14:47:10 S-FL2-PLS-NAC attrd: [5288]: ERROR: cib_native_msgready: Message pending on command channel [5285] Mar 3 14:47:10 S-FL2-PLS-NAC crmd: [5289]: info: verify_stopped: Checking for active resources before exit Mar 3 14:47:10 S-FL2-PLS-NAC attrd: [5288]: ERROR: crm_log_message_adv: #========= cib:cmd message start ==========# Mar 3 14:47:10 S-FL2-PLS-NAC crmd: [5289]: info: verify_stopped: Checking for active resources before exit Mar 3 14:47:10 S-FL2-PLS-NAC attrd: [5288]: ERROR: MSG: No message to dump Mar 3 14:47:10 S-FL2-PLS-NAC crmd: [5289]: info: do_lrm_control: Disconnected from the LRM Mar 3 14:47:10 S-FL2-PLS-NAC mgmtd: [5290]: ERROR: cib_native_msgready: Message pending on command channel [5285] Mar 3 14:47:10 S-FL2-PLS-NAC attrd: [5288]: info: cib_native_msgready: Lost connection to the CIB service [5285]. Mar 3 14:47:10 S-FL2-PLS-NAC crmd: [5289]: info: do_ha_control: Disconnected from Heartbeat Mar 3 14:47:10 S-FL2-PLS-NAC mgmtd: [5290]: ERROR: crm_log_message_adv: #========= cib:cmd message start ==========# Mar 3 14:47:10 S-FL2-PLS-NAC attrd: [5288]: CRIT: cib_native_dispatch: Lost connection to the CIB service [5285/callback]. Mar 3 14:47:10 S-FL2-PLS-NAC crmd: [5289]: info: do_cib_control: Disconnecting CIB Mar 3 14:47:10 S-FL2-PLS-NAC mgmtd: [5290]: ERROR: MSG: No message to dump Mar 3 14:47:10 S-FL2-PLS-NAC crmd: [5289]: info: crmd_cib_connection_destroy: Connection to the CIB terminated... Mar 3 14:47:10 S-FL2-PLS-NAC mgmtd: [5290]: CRIT: cib_native_dispatch: Lost connection to the CIB service [5285/callback]. Mar 3 14:47:10 S-FL2-PLS-NAC crmd: [5289]: info: do_exit: Performing A_EXIT_0 - gracefully exiting the CRMd Mar 3 14:47:10 S-FL2-PLS-NAC crmd: [5289]: ERROR: do_exit: Could not recover from internal error Mar 3 14:47:10 S-FL2-PLS-NAC crmd: [5289]: info: do_exit: [crmd] stopped (2) Mar 3 14:47:10 S-FL2-PLS-NAC heartbeat: [5061]: CRIT: Emergency Shutdown: Master Control process died. Mar 3 14:47:10 S-FL2-PLS-NAC heartbeat: [5061]: CRIT: Killing pid 5057 with SIGTERM Mar 3 14:47:10 S-FL2-PLS-NAC heartbeat: [5061]: CRIT: Killing pid 5062 with SIGTERM Mar 3 14:47:10 S-FL2-PLS-NAC heartbeat: [5061]: CRIT: Killing pid 5063 with SIGTERM Mar 3 14:47:10 S-FL2-PLS-NAC heartbeat: [5061]: CRIT: Killing pid 5064 with SIGTERM Mar 3 14:47:10 S-FL2-PLS-NAC heartbeat: [5061]: CRIT: Killing pid 5065 with SIGTERM Mar 3 14:47:10 S-FL2-PLS-NAC heartbeat: [5061]: CRIT: Killing pid 5066 with SIGTERM Mar 3 14:47:10 S-FL2-PLS-NAC heartbeat: [5061]: CRIT: Killing pid 5067 with SIGTERM Mar 3 14:47:10 S-FL2-PLS-NAC heartbeat: [5061]: CRIT: Killing pid 5068 with SIGTERM Mar 3 14:47:10 S-FL2-PLS-NAC heartbeat: [5061]: CRIT: Killing pid 5069 with SIGTERM Mar 3 14:47:10 S-FL2-PLS-NAC heartbeat: [5061]: CRIT: Emergency Shutdown(MCP dead): Killing ourselves. Mar 3 16:00:12 S-FL2-PLS-NAC auditd[2341]: Audit daemon rotating log files Mar 3 19:06:54 S-FL2-PLS-NAC auditd[2341]: Audit daemon rotating log files Logs from the second node are: Mar 14 19:38:13 S-FL2-SLS-NAC heartbeat: [2411]: WARN: Late heartbeat: Node s-fl2-sls-nac.yardi.com: interval 23220 ms Mar 14 19:38:36 S-FL2-SLS-NAC heartbeat: [2411]: WARN: Late heartbeat: Node s-fl2-sls-nac.yardi.com: interval 23160 ms Mar 14 19:38:59 S-FL2-SLS-NAC heartbeat: [2411]: WARN: Late heartbeat: Node s-fl2-sls-nac.yardi.com: interval 23220 ms Mar 14 19:39:22 S-FL2-SLS-NAC heartbeat: [2411]: WARN: Late heartbeat: Node s-fl2-sls-nac.yardi.com: interval 23180 ms Mar 14 19:39:45 S-FL2-SLS-NAC heartbeat: [2411]: WARN: Late heartbeat: Node s-fl2-sls-nac.yardi.com: interval 23200 ms Mar 14 19:40:08 S-FL2-SLS-NAC heartbeat: [2411]: WARN: Late heartbeat: Node s-fl2-sls-nac.yardi.com: interval 23150 ms Mar 14 19:40:32 S-FL2-SLS-NAC heartbeat: [2411]: WARN: Late heartbeat: Node s-fl2-sls-nac.yardi.com: interval 23200 ms <lots of these messages> Mar 14 19:41:18 S-FL2-SLS-NAC heartbeat: [2411]: WARN: Late heartbeat: Node s-fl2-sls-nac.yardi.com: interval 23250 ms Mar 14 19:41:42 S-FL2-SLS-NAC heartbeat: [2411]: WARN: Late heartbeat: Node s-fl2-sls-nac.yardi.com: interval 23580 ms Mar 14 19:41:42 S-FL2-SLS-NAC heartbeat: [2411]: info: Heartbeat restart on node s-fl2-pls-nac.yardi.com Mar 14 19:41:42 S-FL2-SLS-NAC heartbeat: [2411]: info: Link s-fl2-pls-nac.yardi.com:eth3 up. Mar 14 19:41:42 S-FL2-SLS-NAC heartbeat: [2411]: info: Status update for node s-fl2-pls-nac.yardi.com: status init Mar 14 19:41:42 S-FL2-SLS-NAC heartbeat: [2411]: info: Link s-fl2-pls-nac.yardi.com:eth1 up. Mar 14 19:41:42 S-FL2-SLS-NAC heartbeat: [2411]: info: Status update for node s-fl2-pls-nac.yardi.com: status up Mar 14 19:41:42 S-FL2-SLS-NAC heartbeat: [2411]: ERROR: Message hist queue is filling up (200 messages in queue) Mar 14 19:41:42 S-FL2-SLS-NAC heartbeat: [2411]: ERROR: Message hist queue is filling up (200 messages in queue) Mar 14 19:41:42 S-FL2-SLS-NAC heartbeat: [2411]: info: all clients are now paused Mar 14 19:41:42 S-FL2-SLS-NAC heartbeat: [2411]: ERROR: Message hist queue is filling up (200 messages in queue) Mar 14 19:41:42 S-FL2-SLS-NAC last message repeated 2 times Mar 14 19:41:42 S-FL2-SLS-NAC heartbeat: [2411]: info: Status update for node s-fl2-pls-nac.yardi.com: status active Mar 14 19:41:42 S-FL2-SLS-NAC cib: [2492]: info: cib_client_status_callback: Status update: Client s-fl2-pls-nac.yardi.com/cib now has status [join] Mar 14 19:41:42 S-FL2-SLS-NAC heartbeat: [2411]: WARN: 1 lost packet(s) for [s-fl2-pls-nac.yardi.com] [42:44] Mar 14 19:41:42 S-FL2-SLS-NAC crmd: [2496]: notice: crmd_ha_status_callback: Status update: Node s-fl2-pls-nac.yardi.com now has status [init] Mar 14 19:41:42 S-FL2-SLS-NAC crmd: [2496]: info: crmd_ha_status_callback: Ping node s-fl2-pls-nac.yardi.com is init Mar 14 19:41:42 S-FL2-SLS-NAC crmd: [2496]: notice: crmd_ha_status_callback: Status update: Node s-fl2-pls-nac.yardi.com now has status [up] Mar 14 19:41:42 S-FL2-SLS-NAC crmd: [2496]: info: crmd_ha_status_callback: Ping node s-fl2-pls-nac.yardi.com is up Mar 14 19:41:42 S-FL2-SLS-NAC crmd: [2496]: notice: crmd_ha_status_callback: Status update: Node s-fl2-pls-nac.yardi.com now has status [active] Mar 14 19:41:42 S-FL2-SLS-NAC cib: [2492]: info: cib_diff_notify: Local-only Change (client:2496, call: 175): 0.26.612 (ok) Mar 14 19:41:42 S-FL2-SLS-NAC tengine: [10419]: info: te_update_diff: Processing diff (cib_update): 0.26.612 -> 0.26.612 Mar 14 19:41:42 S-FL2-SLS-NAC cib: [2991]: info: write_cib_contents: Wrote version 0.26.612 of the CIB to disk (digest: e9e9c5aebf16b1faf617dca58907fc8c) Mar 14 19:41:43 S-FL2-SLS-NAC heartbeat: [2411]: ERROR: Message hist queue is filling up (200 messages in queue) Mar 14 19:41:43 S-FL2-SLS-NAC heartbeat: [2411]: info: No pkts missing from s-fl2-pls-nac.yardi.com! Mar 14 19:41:43 S-FL2-SLS-NAC heartbeat: [2411]: ERROR: Message hist queue is filling up (200 messages in queue) Mar 14 19:41:43 S-FL2-SLS-NAC heartbeat: [2411]: WARN: 1 lost packet(s) for [s-fl2-pls-nac.yardi.com] [47:49] Mar 14 19:41:43 S-FL2-SLS-NAC crmd: [2496]: notice: crmd_client_status_callback: Status update: Client s-fl2-pls-nac.yardi.com/crmd now has status [online] Mar 14 19:41:43 S-FL2-SLS-NAC crmd: [2496]: info: crmd_client_status_callback: Uncaching UUID for s-fl2-pls-nac.yardi.com Mar 14 19:41:44 S-FL2-SLS-NAC heartbeat: [2411]: ERROR: Message hist queue is filling up (200 messages in queue) Mar 14 19:41:44 S-FL2-SLS-NAC heartbeat: [2411]: info: No pkts missing from s-fl2-pls-nac.yardi.com! Mar 14 19:41:44 S-FL2-SLS-NAC heartbeat: [2411]: ERROR: Message hist queue is filling up (200 messages in queue) Mar 14 19:41:44 S-FL2-SLS-NAC heartbeat: [2411]: info: all clients are now resumed Mar 14 19:41:44 S-FL2-SLS-NAC cib: [2492]: info: cib_process_readwrite: We are now in R/O mode Mar 14 19:41:44 S-FL2-SLS-NAC cib: [2492]: WARN: cib_process_diff: Diff 0.26.600 -> 0.26.601 not applied to 0.26.612: current "num_updates" is greater than required Mar 14 19:41:44 S-FL2-SLS-NAC cib: [2492]: WARN: do_cib_notify: cib_apply_diff of <diff > FAILED: Application of an update diff failed Mar 14 19:41:44 S-FL2-SLS-NAC cib: [2492]: WARN: cib_process_request: cib_apply_diff operation failed: Application of an update diff failed Mar 14 19:41:44 S-FL2-SLS-NAC cib: [2492]: WARN: cib_process_replace: Replacement 0.26.601 not applied to 0.26.612: current num_updates is greater than the replacement <lots of these messages> Mar 14 19:41:47 S-FL2-SLS-NAC ccm: [2491]: ERROR: Lost connection to heartbeat service. Need to bail out. Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: ERROR: get_uuid: get_uuid_by_name() call failed for host s-fl2-pls-nac.yardi.com Mar 14 19:41:47 S-FL2-SLS-NAC cib: [2492]: ERROR: cib_ha_connection_destroy: Heartbeat connection lost! Exiting. Mar 14 19:41:47 S-FL2-SLS-NAC cib: [2492]: info: uninitializeCib: The CIB has been deallocated. Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: info: do_state_transition: s-fl2-sls-nac.yardi.com: State transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_HA_MESSAGE origin=route_message ] Mar 14 19:41:47 S-FL2-SLS-NAC tengine: [10419]: info: update_abort_priority: Abort priority upgraded to 1000000 Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: info: update_dc: Set DC to <null> (<null>) Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: ERROR: send_ha_message: Not connected to Heartbeat Mar 14 19:41:47 S-FL2-SLS-NAC stonithd: [2494]: ERROR: Disconnected with heartbeat daemon Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: WARN: crm_log_message_adv: #========= HA[outbound] message start ==========# Mar 14 19:41:47 S-FL2-SLS-NAC tengine: [10419]: ERROR: cib_native_msgready: Message pending on command channel [2492] Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: WARN: MSG: Dumping message with 10 fields Mar 14 19:41:47 S-FL2-SLS-NAC mgmtd: [2497]: ERROR: cib_native_msgready: Message pending on command channel [2492] Mar 14 19:41:47 S-FL2-SLS-NAC tengine: [10419]: ERROR: crm_log_message_adv: #========= cib:cmd message start ==========# Mar 14 19:41:47 S-FL2-SLS-NAC attrd: [2495]: CRIT: attrd_ha_connection_destroy: Lost connection to heartbeat service! Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: WARN: MSG[0] : [origin=join_make_offer] Mar 14 19:41:47 S-FL2-SLS-NAC mgmtd: [2497]: ERROR: crm_log_message_adv: #========= cib:cmd message start ==========# Mar 14 19:41:47 S-FL2-SLS-NAC tengine: [10419]: ERROR: MSG: No message to dump Mar 14 19:41:47 S-FL2-SLS-NAC attrd: [2495]: ERROR: cib_native_msgready: Message pending on command channel [2492] Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: WARN: MSG[1] : [t=crmd] Mar 14 19:41:47 S-FL2-SLS-NAC mgmtd: [2497]: ERROR: MSG: No message to dump Mar 14 19:41:47 S-FL2-SLS-NAC tengine: [10419]: info: cib_native_msgready: Lost connection to the CIB service [2492]. Mar 14 19:41:47 S-FL2-SLS-NAC attrd: [2495]: ERROR: crm_log_message_adv: #========= cib:cmd message start ==========# Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: WARN: MSG[2] : [version=1.0.7] Mar 14 19:41:47 S-FL2-SLS-NAC attrd: [2495]: ERROR: MSG: No message to dump Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: WARN: MSG[3] : [subt=request] Mar 14 19:41:47 S-FL2-SLS-NAC attrd: [2495]: info: cib_native_msgready: Lost connection to the CIB service [2492]. Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: WARN: MSG[4] : [reference=join_offer-dc-1205503907-113] Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: WARN: MSG[5] : [crm_task=join_offer] Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: WARN: MSG[6] : [crm_sys_to=crmd] Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: WARN: MSG[7] : [crm_sys_from=dc] Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: WARN: MSG[8] : [crm_host_to= s-fl2-pls-nac.yardi.com] Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: WARN: MSG[9] : [join_id=8] Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: WARN: send_msg_via_ha: Sending directed HA message (ref=join_offer-dc-1205503907-113) to [EMAIL PROTECTED] failed. Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: ERROR: send_ha_message: Not connected to Heartbeat Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: WARN: crm_log_message_adv: #========= HA[outbound] message start ==========# Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: WARN: MSG: Dumping message with 10 fields Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: WARN: MSG[0] : [origin=join_make_offer] Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: WARN: MSG[1] : [t=crmd] Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: WARN: MSG[2] : [version=1.0.7] Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: WARN: MSG[3] : [subt=request] Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: WARN: MSG[4] : [reference=join_offer-dc-1205503907-114] Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: WARN: MSG[5] : [crm_task=join_offer] Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: WARN: MSG[6] : [crm_sys_to=crmd] Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: WARN: MSG[7] : [crm_sys_from=dc] Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: WARN: MSG[8] : [crm_host_to= s-fl2-sls-nac.yardi.com] Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: WARN: MSG[9] : [join_id=8] Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: WARN: send_msg_via_ha: Sending directed HA message (ref=join_offer-dc-1205503907-114) to [EMAIL PROTECTED] failed. Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: info: do_dc_join_offer_all: join-8: Waiting on 2 outstanding join acks Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: info: do_election_count_vote: Election check: vote from s-fl2-pls-nac.yardi.com Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: info: do_election_count_vote: Election won over s-fl2-pls-nac.yardi.com Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: info: do_state_transition: s-fl2-sls-nac.yardi.com: State transition S_INTEGRATION -> S_ELECTION [ input=I_ELECTION cause=C_FSA_INTERNAL origin=do_election_count_vote ] Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: info: update_dc: Set DC to <null> (<null>) Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: ERROR: send_ha_message: Not connected to Heartbeat Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: WARN: crm_log_message_adv: #========= HA[outbound] message start ==========# Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: WARN: MSG: Dumping message with 10 fields Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: WARN: MSG[0] : [origin=do_election_vote] Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: WARN: MSG[1] : [t=crmd] Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: WARN: MSG[2] : [version=1.0.7] Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: WARN: MSG[3] : [subt=request] Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: WARN: MSG[4] : [reference=vote-crmd-1205503907-115] Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: WARN: MSG[5] : [crm_task=vote] Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: WARN: MSG[6] : [crm_sys_to=crmd] Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: WARN: MSG[7] : [crm_sys_from=crmd] Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: WARN: MSG[8] : [election-owner=a5ea4881-0e06-4ea3-83a9-1d0f2184109d] Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: WARN: MSG[9] : [election-id=4] Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: WARN: send_msg_via_ha: Sending broadcast HA message (ref=vote-crmd-1205503907-115) to crmd@<all> failed. Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: info: populate_cib_nodes: Requesting the list of configured nodes Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: ERROR: get_uuid: get_uuid_by_name() call failed for host s-fl2-pls-nac.yardi.com Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: ERROR: crm_abort: add_node_copy: Triggered non-fatal assert at xml.c:281 : src_node != NULL Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: ERROR: crmd_cib_connection_destroy: Connection to the CIB terminated... Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: ERROR: send_ha_message: Not connected to Heartbeat Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: WARN: crm_log_message_adv: #========= HA[outbound] message start ==========# Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: WARN: MSG: Dumping message with 10 fields Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: WARN: MSG[0] : [origin=do_election_vote] Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: WARN: MSG[1] : [t=crmd] Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: WARN: MSG[2] : [version=1.0.7] Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: WARN: MSG[3] : [subt=request] Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: WARN: MSG[4] : [reference=vote-crmd-1205503907-116] Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: WARN: MSG[5] : [crm_task=vote] Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: WARN: MSG[6] : [crm_sys_to=crmd] Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: WARN: MSG[7] : [crm_sys_from=crmd] Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: WARN: MSG[8] : [election-owner=a5ea4881-0e06-4ea3-83a9-1d0f2184109d] Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: WARN: MSG[9] : [election-id=5] Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: WARN: send_msg_via_ha: Sending broadcast HA message (ref=vote-crmd-1205503907-116) to crmd@<all> failed. Mar 14 19:41:47 S-FL2-SLS-NAC mgmtd: [2497]: ERROR: Lost connection to heartbeat service. Mar 14 19:41:47 S-FL2-SLS-NAC stonithd: [2494]: notice: /usr/lib/heartbeat/stonithd normally quit. Mar 14 19:41:47 S-FL2-SLS-NAC tengine: [10419]: ERROR: stonithd_op_result_ready: failed due to not on signon status. Mar 14 19:41:47 S-FL2-SLS-NAC tengine: [10419]: ERROR: tengine_stonith_connection_destroy: Fencing daemon has left us Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: ERROR: do_log: [[FSA]] Input I_ERROR from crmd_cib_connection_destroy() received in state (S_ELECTION) Mar 14 19:41:47 S-FL2-SLS-NAC tengine: [10419]: info: update_abort_priority: Abort action 2 superceeded by 3 Mar 14 19:41:47 S-FL2-SLS-NAC pengine: [10420]: info: pengine_shutdown: Exiting PEngine (SIGTERM) Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: info: do_state_transition: s-fl2-sls-nac.yardi.com: State transition S_ELECTION -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL origin=crmd_cib_connection_destroy ] Mar 14 19:41:47 S-FL2-SLS-NAC tengine: [10419]: info: notify_crmd: Exiting after transition Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: ERROR: do_recover: Action A_RECOVER (0000000001000000) not supported Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: info: do_dc_release: DC role released Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: info: stop_subsystem: Sent -TERM to pengine: [10420] Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: info: stop_subsystem: Sent -TERM to tengine: [10419] Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: ERROR: do_log: [[FSA]] Input I_STOP from do_recover() received in state (S_RECOVERY) Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: info: do_state_transition: s-fl2-sls-nac.yardi.com: State transition S_RECOVERY -> S_STOPPING [ input=I_STOP cause=C_FSA_INTERNAL origin=do_recover ] Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: info: do_dc_release: DC role released Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: info: stop_subsystem: Sent -TERM to pengine: [10420] Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: info: stop_subsystem: Sent -TERM to tengine: [10419] Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: info: do_shutdown: Terminating the pengine Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: info: stop_subsystem: Sent -TERM to pengine: [10420] Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: info: do_shutdown: Terminating the tengine Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: info: stop_subsystem: Sent -TERM to tengine: [10419] Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: info: do_shutdown: Waiting for subsystems to exit Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: WARN: register_fsa_input_adv: do_shutdown stalled the FSA with pending inputs Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: WARN: do_log: [[FSA]] Input I_RELEASE_SUCCESS from do_dc_release() received in state (S_STOPPING) Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: info: do_shutdown: Terminating the pengine Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: info: stop_subsystem: Sent -TERM to pengine: [10420] Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: info: do_shutdown: Terminating the tengine Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: info: stop_subsystem: Sent -TERM to tengine: [10419] Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: info: do_shutdown: Waiting for subsystems to exit Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: WARN: register_fsa_input_adv: do_shutdown stalled the FSA with pending inputs Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: info: mem_handle_func:IPC broken, ccm is dead before the client! Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: WARN: do_log: [[FSA]] Input I_RELEASE_SUCCESS from do_dc_release() received in state (S_STOPPING) Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: info: do_shutdown: Terminating the pengine Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: info: stop_subsystem: Sent -TERM to pengine: [10420] Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: info: do_shutdown: Terminating the tengine Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: info: stop_subsystem: Sent -TERM to tengine: [10419] Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: info: do_shutdown: Waiting for subsystems to exit Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: info: verify_stopped: Checking for active resources before exit Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: ERROR: verify_stopped: 10 pending LRM operations at shutdown Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: ERROR: ghash_print_pending: Pending action: Event-Gateway:25 Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: ERROR: ghash_print_pending: Pending action: Policy-Manager:41 Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: ERROR: ghash_print_pending: Pending action: Event-Correlation:39 Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: ERROR: ghash_print_pending: Pending action: Check-Drives:13 Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: ERROR: ghash_print_pending: Pending action: IPaddr_corp:19 Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: ERROR: ghash_print_pending: Pending action: Master-Database:21 Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: ERROR: ghash_print_pending: Pending action: IPaddr_mgmt:15 Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: ERROR: ghash_print_pending: Pending action: IPaddr_log:17 Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: ERROR: ghash_print_pending: Pending action: Events-Database:23 Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: ERROR: ghash_print_pending: Pending action: Admin-Notify:43 Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: ERROR: verify_stopped: Resource Events-Database was active at shutdown. You may ignore this error if it is unmanaged. Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: ERROR: verify_stopped: Resource Event-Gateway was active at shutdown. You may ignore this error if it is unmanaged. Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: ERROR: verify_stopped: Resource Policy-Manager was active at shutdown. You may ignore this error if it is unmanaged. Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: ERROR: verify_stopped: Resource IPaddr_corp was active at shutdown. You may ignore this error if it is unmanaged. Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: ERROR: verify_stopped: Resource IPaddr_mgmt was active at shutdown. You may ignore this error if it is unmanaged. Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: ERROR: verify_stopped: Resource IPaddr_log was active at shutdown. You may ignore this error if it is unmanaged. Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: ERROR: verify_stopped: Resource Master-Database was active at shutdown. You may ignore this error if it is unmanaged. Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: ERROR: verify_stopped: Resource Event-Correlation was active at shutdown. You may ignore this error if it is unmanaged. Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: ERROR: verify_stopped: Resource Check-Drives was active at shutdown. You may ignore this error if it is unmanaged. Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: ERROR: verify_stopped: Resource Admin-Notify was active at shutdown. You may ignore this error if it is unmanaged. Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: ERROR: do_exit: Performing A_EXIT_1 - forcefully exiting the CRMd Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: ERROR: do_exit: Could not recover from internal error Mar 14 19:41:47 S-FL2-SLS-NAC crmd: [2496]: info: do_exit: [crmd] stopped (2) Mar 14 19:41:48 S-FL2-SLS-NAC heartbeat: [2442]: CRIT: Emergency Shutdown: Master Control process died. Mar 14 19:41:48 S-FL2-SLS-NAC heartbeat: [2442]: CRIT: Killing pid 2411 with SIGTERM Mar 14 19:41:48 S-FL2-SLS-NAC heartbeat: [2442]: CRIT: Killing pid 2443 with SIGTERM Mar 14 19:41:48 S-FL2-SLS-NAC heartbeat: [2442]: CRIT: Killing pid 2444 with SIGTERM Mar 14 19:41:48 S-FL2-SLS-NAC heartbeat: [2442]: CRIT: Killing pid 2445 with SIGTERM Mar 14 19:41:48 S-FL2-SLS-NAC heartbeat: [2442]: CRIT: Killing pid 2446 with SIGTERM Mar 14 19:41:48 S-FL2-SLS-NAC heartbeat: [2442]: CRIT: Killing pid 2447 with SIGTERM Mar 14 19:41:48 S-FL2-SLS-NAC heartbeat: [2442]: CRIT: Killing pid 2448 with SIGTERM Mar 14 19:41:48 S-FL2-SLS-NAC heartbeat: [2442]: CRIT: Killing pid 2449 with SIGTERM Mar 14 19:41:48 S-FL2-SLS-NAC heartbeat: [2442]: CRIT: Killing pid 2450 with SIGTERM Mar 14 19:41:48 S-FL2-SLS-NAC heartbeat: [2442]: CRIT: Emergency Shutdown(MCP dead): Killing ourselves. Mar 14 23:02:12 S-FL2-SLS-NAC auditd[2292]: Audit daemon rotating log files Mar 15 07:22:19 S-FL2-SLS-NAC auditd[2292]: Audit daemon rotating log files _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
