Hello,

I have a 2 node cluster on RHEL 5.4.  I am currently only running the heartbeat 
service on one node because the heartbeat service kills itself and I'm trying 
to avoid downtime/split brain issues.  I've tried searching and I found posts 
that have similar problems.  I am running heartbeat 3.0.2-1.  Below are the 
same messages I am getting (from a different post).  Does anyone know if this 
is a known issue or can point me in the right direction?  I'm stumped.


> On Wed, Aug 18, 2010 at 06:27:02PM +0200, David Mohr wrote:

> >

> > Hi,

> > we were surprised to find our cluster in disarray today: It seems like the

> > heartbeat process died on one of the nodes. These servers are essentially

> > idle since we haven't started using them in production just yet. I tried to

> > google for these errors but to no avail.

> >

> > It is pretty troubling that heartbeat can just die and there is no

> > built-in restart mechanism. Should we build something like that externally?

> > Or what is going on here?

> >

> > We are using heartbeat 3.0.3, and there was precious little syslog

> > messages:

> >

> > Aug 17 23:26:43 s1a stonithd: [19174]: info: ha_msg_dispatch: Lost

> > connection to heartbeat service.

> > Aug 17 23:26:43 s1a cib: [19172]: info: ha_msg_dispatch: Lost connection

> > to heartbeat service.

> > Aug 17 23:26:43 s1a crmd: [19176]: info: ha_msg_dispatch: Lost connection

> > to heartbeat service.

> > Aug 17 23:26:43 s1a attrd: [19175]: info: ha_msg_dispatch: Lost connection

> > to heartbeat service.

> > Aug 17 23:26:43 s1a cib: [19172]: info: mem_handle_func:IPC broken, ccm is

> > dead before the client!

> > Aug 17 23:26:43 s1a crmd: [19176]: info: mem_handle_func:IPC broken, ccm

> > is dead before the client!

> > Aug 17 23:26:43 s1a crmd: [19176]: info: do_state_transition: State

> > transition S_NOT_DC -> S_RECOVERY [ input=I_ERROR cause=C_CCM_CALLBACK

> > origin=ccm_dispatch ]

> > Aug 17 23:26:43 s1a crmd: [19176]: info: do_state_transition: State

> > transition S_RECOVERY -> S_TERMINATE [ input=I_TERMINATE

> > cause=C_FSA_INTERNAL origin=do_recover ]

> > Aug 17 23:26:43 s1a crmd: [19176]: info: do_shutdown: All subsystems

> > stopped, continuing

> > Aug 17 23:26:43 s1a attrd: [19175]: info: cib_native_msgready: Lost

> > connection to the CIB service [19172].

> > Aug 17 23:26:43 s1a crmd: [19176]: notice: ghash_print_pending_for_rsc:

> > Recurring action pingd_stornet:0:9 (pingd_stornet:0_monitor_10000)

> > incomplete at shutdown

> > Aug 17 23:26:43 s1a crmd: [19176]: notice: ghash_print_pending_for_rsc:

> > Recurring action drbd0:0:11 (drbd0:0_monitor_60000) incomplete at shutdown

> > Aug 17 23:26:43 s1a crmd: [19176]: info: do_lrm_control: Disconnected from

> > the LRM

> > Aug 17 23:26:43 s1a crmd: [19176]: info: do_ha_control: Disconnected from

> > Heartbeat

> > Aug 17 23:26:43 s1a crmd: [19176]: info: do_cib_control: Disconnecting CIB

> > Aug 17 23:26:43 s1a crmd: [19176]: info: crmd_cib_connection_destroy:

> > Connection to the CIB terminated...

> > Aug 17 23:26:43 s1a crmd: [19176]: info: do_exit: Performing A_EXIT_0 -

> > gracefully exiting the CRMd

> > Aug 17 23:26:43 s1a crmd: [19176]: info: free_mem: Dropping I_TERMINATE: [

> > state=S_TERMINATE cause=C_FSA_INTERNAL origin=do_stop ]

> > Aug 17 23:26:43 s1a crmd: [19176]: info: do_exit: [crmd] stopped (2)

> > Aug 17 23:26:44 s1a pingd: [19281]: info: attrd_update: Could not send

> > update: pingd_stornet=100 for localhost

> > Aug 17 23:26:46 s1a pingd: [19281]: info: attrd_lazy_update: Connecting to

> > cluster... 4 retries remaining

> > Aug 17 23:26:48 s1a pingd: [19281]: info: attrd_lazy_update: Connecting to

> > cluster... 3 retries remaining

> > Aug 17 23:26:50 s1a pingd: [19281]: info: attrd_lazy_update: Connecting to

> > cluster... 2 retries remaining

> > Aug 17 23:26:52 s1a pingd: [19281]: info: attrd_lazy_update: Connecting to

> > cluster... 1 retries remaining

> > Aug 17 23:26:54 s1a pingd: [19281]: info: attrd_lazy_update: Connecting to

> > cluster... 5 retries remaining

> > Aug 17 23:26:56 s1a pingd: [19281]: info: attrd_lazy_update: Connecting to

> > cluster... 4 retries remaining

> >

> > The hb_report is available upon request. Unfortunately we had just turned

> > down the debug logging, so we do not have debug output available.

> > One interesting excerpt is:

> > Aug 17 23:26:43 s1a heartbeat: [19161]: CRIT: Emergency Shutdown: Master

> > Control process died.

>

> Looks like the MCP crashed. Do you have core dumps enabled?

>

> Thanks,

>

> Dejan

>

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to