(resending with a proper subject, sorry - i'm using digest mode)

Thanks for the prompt reply. See comments below.
I must stress that I'm in a lot of pressure about this, and it's pretty
critical to solve these issues ASAP. So sorry if I'm a bit paniced :S

On Fri, Sep 5, 2008 at 9:00 PM, <[EMAIL PROTECTED]> wrote:

> Hi,
>
> On Thu, Sep 04, 2008 at 04:57:39PM -0400, Itay Donenhirsch wrote:
> > Hi all,
> > I've got a serious crash in heartbeat.
> > The scenario is that I start with 4 stations: ibp1 ibp2 ibp3 and
> > ibp-standby.
> > I get all hosts connected and online.
> > I then shutdown the network switch and brings back up.
> > The some of the stations (many times all of them) keep rebooting with:
> >  heartbeat: [5013]: EMERG: Rebooting system.  Reason:
> > /usr/lib64/heartbeat/crmd
>
> That's a recovery measure.
> crmd is crashing or being killed. Perhaps you should upgrade.
>

I investigated it further and it seems that the heartbeat process kills all
the other processes.
It happens when all the nodes are getting back up, and then I see this
message in the log:
Sep  4 22:56:46 [EMAIL PROTECTED] heartbeat: [3289]: ERROR: Cannot write to 
media
pipe 0: Resource temporarily unavailable

That looked very weird to me, and I found about
http://developerbugs.linux-foundation.org/show_bug.cgi?id=1697
I've seen in the code that this path wasn't applied in 2.1.4 and tried to
put it in, but it didn't solve anything. Not sure I did that right, as the
code is a bit changed since then.

Another thing I noticed is that /var/heartbeat/pengine fills up (1000s of
files).

About the upgrade - i'm already at 2.1.4 (sorry for not mentioning it
before).


>
> > This keeps going in a loop untill I stop heartbeat before it reboots
> again.
>
> Replace crm yes with crm respawn in ha.cf until you fix it.
>

Wont that just cause the CRM to restart in a loop?


>
> > Please help, I really don't know what to do.
> > Attached is the end of the log file of station ibp3. Some EMERGs are
> visible
> > there.
> >
> > Thanks,
> > Itay
>
> Thanks,
>
> Dejan
>
>
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to