Re: [Linux-HA] Heartbeat Reboot - Why?

Dejan Muhamedagic Fri, 02 Jan 2009 01:26:26 -0800

Hi,

On Wed, Dec 31, 2008 at 11:03:19AM -0800, Mike Sweetser - Adhost wrote:
> > The reason for reboot was that the crmd encountered an
> > unrecoverable condition and exited. It is not clear what happened
> > to crmd. It could be a communication problem, though there's
> > nothing in the logs from the lower layer (heartbeat). BTW, you
> > can prevent reboots by replacing "crm yes" with "crm respawn" in
> > ha.cf, though that probably won't help.
> 
> Can I do this without stopping/starting Heartbeat?


There's some support for rereading the configuration, but I'd
rather restart heartbeat. Anyway, your nodes should be able to
restore themselves to a good state after reboot.

> > A reboot shouldn't leave your system in an unstable state. What
> > do you mean by "unstable"? Why there was no failover? 
> 
> That is an excellent question, and one that I'd really like to have the
> answer to.  It was almost as if the secondary node did not detect the
> outage.
> 
> > Can you
> > please produce a hb_report report. That would include the
> > configuration and logs from both nodes and all other relevant
> > information.
> 
> I've attached the hb_report - let me know if there's another way I
> should do this.

This is what I found:

1. There's a double entry for node f1-uda01 in the status
section. No idea how that happened, perhaps earlier logs would
reveal. CRM probably doesn't find that agreeable, so it must be
removed:

cibadmin -D -o nodes -X '<node id="4fedd91c-5fcd-491a-b1cc-6b6734525002" 
uname="f1-uda01.adhost.com" type="normal"/>'

If cibadmin refuses to do that because there's no quorum, try to
add the --force option.

That should hopefully fix the DC election which is failing right
now.

2. Three drbddisk resources (s0-drbd, r1-drbd, r2-drbd) failed to
start sometimes in the past on f1-uda01 with exit code 20. No
idea what that code means. Try to find out from earlier logs.
There should've been, hopefully, some output from drbdadm or
such. Perhaps ask on the linbit mailing list since drbddisk is
part of the drbd package.

You should check why these resources failed and fix them _before_
trying to fix the cluster. If your resources are failing, the
cluster probably won't be of much use.

3. crmadmin on f2-uda01 failed (or timed out) to connect to crmd.
Also, there's no cib.xml in /var/lib/heartbeat/crm. Perhaps
permissions are wrong. After checking permissions, you should
restart heartbeat.

4. The heartbeat/CRM version is good, but how did you install
heartbeat: there's no rpm package.

> By the way, hb_report would not properly detect the log until I
> commented out some of the detecti
> on code in /usr/share/heartbeat/utillib.sh:
> 
> 123,125c123,125
> <       #if [ "$HA_SYSLOGMSGFMT" -o "$HA_LOGFACILITY" ]; then
> <       #       awk '{print $1,$2,$3}'
> <       #else
> ---
> >       if [ "$HA_SYSLOGMSGFMT" -o "$HA_LOGFACILITY" ]; then
> >               awk '{print $1,$2,$3}'
> >       else
> 127c127
> <       #fi
> ---
> >       fi

Your ha.cf has a bit too many logging options specified:

logfacility     local0
debugfile /var/log/ha-debug
logfile /var/log/ha-log

hb_report prefers syslog. You should remove either the
logfacility or the two *file statements.

Oh, and it would be good to upgrade to 2.1.4. 2.1.3 has some
known issues which may get exercised, depending on your
configuration.

Thanks,

Dejan

> Thank You,
> Mike Sweetser
> 
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Heartbeat Reboot - Why?

Reply via email to