Re: [Linux-HA] Node failure causes peer host to reboot?!?

Dejan Muhamedagic Thu, 17 Apr 2008 05:01:23 -0700

On Thu, Apr 17, 2008 at 01:03:09PM +0200, Andrew Beekhof wrote:
> On Thu, Apr 17, 2008 at 12:58 PM, Andrew Beekhof <[EMAIL PROTECTED]> wrote:
> >
> > On Thu, Apr 17, 2008 at 12:56 PM, Andrew Beekhof <[EMAIL PROTECTED]> wrote:
> >  > On Thu, Apr 17, 2008 at 12:35 PM, Luis Motta Campos
> >  >  <[EMAIL PROTECTED]> wrote:
> >  >  > Dejan Muhamedagic wrote:
> >  >  >  > Hi,
> >  >  >
> >  >  > >> respawn hacluster /usr/lib64/heartbeat/ipfail
> >  >  >  >
> >  >  >  > ipfail doesn't work with crm. You should use pingd instead.
> >  >  >
> >  >  >  Well, I don't think this helps. :( I'm using the suggested 
> > (reasonable
> >  >  >  for me) defaults:
> >  >  >
> >  >  >  respawn root /usr/lib64/heartbeat/pingd -m 100 -d 5s
> >  >  >
> >  >  >  (yes, I'm running CentOS x86_64).
> >  >  >
> >  >  >  I still have problems, but they seem to be worse, now. Before, if I
> >  >  >  restarted heartbeat (/etc/init.d/heartbeat restart), any service 
> > running
> >  >  >  on the machine jumped away before the restart, and heartbeat was 
> > able to
> >  >  >  restart ok.
> >  >  >
> >  >  >  Using pingd instead of the ipfail, even this is crippled, and 
> > heartbeat
> >  >  >  reboots the peer host (the one supposed to keep services running) if 
> > I
> >  >  >  try to restart the heartbeat service on one of the machines.
> >  >  >
> >  >  >  I presume I'm doing something really stupid, but I can't understand 
> > it.
> >  >  >  Please help me out. I used hb_report to fetch all I know about my
> >  >  >  system, please find the report attached.
> >  >  >
> >  >
> >  >  random question - did you install from source or packages?  where did
> >  >  you get them from?
> >  >
> >
> >  and a followup... you cant just make up values for target_role:
> >
> >              <nvpair name="target_role" value="Started:Master"
> >  id="d54bdbb8-5d79-4d12-a95f-9b9b015176e3"/>
> >
> >  makes no sense.  just "Master" would be correct
> >
> 
> Then there is the failed start operation... that wont be helping at all.
> 
> pengine[13743]: 2008/04/17_12:23:22 WARN: unpack_rsc_op: Processing
> failed op database-filesystem_start_0 on db-sql1.ripe.net: Error
> 
> And finally, it looks like there was a crash in the pengine process.
> 
> crmd[12352]: 2008/04/17_12:23:22 WARN: Managed pengine process 13743
> killed by signal 11 [SIGSEGV - Segmentation violation].
> crmd[12352]: 2008/04/17_12:23:22 ERROR: Managed pengine process 13743
> dumped core
> 
> can you have a look for a core file in
> /var/lib/heartbeat/cores/hacluster/ and post the backtrace?


Hmm, again hb_report didn't produce a backtrace, this time not
there's not even a header/footer by echo(1). That's really
strange.

Luis: Did you see any errors while running hb_report?

Thanks,

Dejan

> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Node failure causes peer host to reboot?!?

Reply via email to