Hello, I have a 2 node cluster on RHEL 5.4. I am currently only running the heartbeat service on one node because the heartbeat service kills itself and I'm trying to avoid downtime/split brain issues. I've tried searching and I found posts that have similar problems. I am running heartbeat 3.0.2-1. Below are the same messages I am getting (from a different post). Does anyone know if this is a known issue or can point me in the right direction? I'm stumped.
> On Wed, Aug 18, 2010 at 06:27:02PM +0200, David Mohr wrote: > > > > Hi, > > we were surprised to find our cluster in disarray today: It seems like the > > heartbeat process died on one of the nodes. These servers are essentially > > idle since we haven't started using them in production just yet. I tried to > > google for these errors but to no avail. > > > > It is pretty troubling that heartbeat can just die and there is no > > built-in restart mechanism. Should we build something like that externally? > > Or what is going on here? > > > > We are using heartbeat 3.0.3, and there was precious little syslog > > messages: > > > > Aug 17 23:26:43 s1a stonithd: [19174]: info: ha_msg_dispatch: Lost > > connection to heartbeat service. > > Aug 17 23:26:43 s1a cib: [19172]: info: ha_msg_dispatch: Lost connection > > to heartbeat service. > > Aug 17 23:26:43 s1a crmd: [19176]: info: ha_msg_dispatch: Lost connection > > to heartbeat service. > > Aug 17 23:26:43 s1a attrd: [19175]: info: ha_msg_dispatch: Lost connection > > to heartbeat service. > > Aug 17 23:26:43 s1a cib: [19172]: info: mem_handle_func:IPC broken, ccm is > > dead before the client! > > Aug 17 23:26:43 s1a crmd: [19176]: info: mem_handle_func:IPC broken, ccm > > is dead before the client! > > Aug 17 23:26:43 s1a crmd: [19176]: info: do_state_transition: State > > transition S_NOT_DC -> S_RECOVERY [ input=I_ERROR cause=C_CCM_CALLBACK > > origin=ccm_dispatch ] > > Aug 17 23:26:43 s1a crmd: [19176]: info: do_state_transition: State > > transition S_RECOVERY -> S_TERMINATE [ input=I_TERMINATE > > cause=C_FSA_INTERNAL origin=do_recover ] > > Aug 17 23:26:43 s1a crmd: [19176]: info: do_shutdown: All subsystems > > stopped, continuing > > Aug 17 23:26:43 s1a attrd: [19175]: info: cib_native_msgready: Lost > > connection to the CIB service [19172]. > > Aug 17 23:26:43 s1a crmd: [19176]: notice: ghash_print_pending_for_rsc: > > Recurring action pingd_stornet:0:9 (pingd_stornet:0_monitor_10000) > > incomplete at shutdown > > Aug 17 23:26:43 s1a crmd: [19176]: notice: ghash_print_pending_for_rsc: > > Recurring action drbd0:0:11 (drbd0:0_monitor_60000) incomplete at shutdown > > Aug 17 23:26:43 s1a crmd: [19176]: info: do_lrm_control: Disconnected from > > the LRM > > Aug 17 23:26:43 s1a crmd: [19176]: info: do_ha_control: Disconnected from > > Heartbeat > > Aug 17 23:26:43 s1a crmd: [19176]: info: do_cib_control: Disconnecting CIB > > Aug 17 23:26:43 s1a crmd: [19176]: info: crmd_cib_connection_destroy: > > Connection to the CIB terminated... > > Aug 17 23:26:43 s1a crmd: [19176]: info: do_exit: Performing A_EXIT_0 - > > gracefully exiting the CRMd > > Aug 17 23:26:43 s1a crmd: [19176]: info: free_mem: Dropping I_TERMINATE: [ > > state=S_TERMINATE cause=C_FSA_INTERNAL origin=do_stop ] > > Aug 17 23:26:43 s1a crmd: [19176]: info: do_exit: [crmd] stopped (2) > > Aug 17 23:26:44 s1a pingd: [19281]: info: attrd_update: Could not send > > update: pingd_stornet=100 for localhost > > Aug 17 23:26:46 s1a pingd: [19281]: info: attrd_lazy_update: Connecting to > > cluster... 4 retries remaining > > Aug 17 23:26:48 s1a pingd: [19281]: info: attrd_lazy_update: Connecting to > > cluster... 3 retries remaining > > Aug 17 23:26:50 s1a pingd: [19281]: info: attrd_lazy_update: Connecting to > > cluster... 2 retries remaining > > Aug 17 23:26:52 s1a pingd: [19281]: info: attrd_lazy_update: Connecting to > > cluster... 1 retries remaining > > Aug 17 23:26:54 s1a pingd: [19281]: info: attrd_lazy_update: Connecting to > > cluster... 5 retries remaining > > Aug 17 23:26:56 s1a pingd: [19281]: info: attrd_lazy_update: Connecting to > > cluster... 4 retries remaining > > > > The hb_report is available upon request. Unfortunately we had just turned > > down the debug logging, so we do not have debug output available. > > One interesting excerpt is: > > Aug 17 23:26:43 s1a heartbeat: [19161]: CRIT: Emergency Shutdown: Master > > Control process died. > > Looks like the MCP crashed. Do you have core dumps enabled? > > Thanks, > > Dejan > _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
