Re: [Linux-HA] problem with drbd verify and heartbeat?

Jean-Francois Malouin Fri, 05 Sep 2008 06:59:51 -0700

* Dejan Muhamedagic <[EMAIL PROTECTED]> [20080905 06:58]:
> On Thu, Sep 04, 2008 at 02:52:50PM -0400, Jean-Francois Malouin wrote:
> > Hi,
> > 
> > Hardware: (don't laugh please) 2 Dell 1650 `powered' 
> > by duals Intel PIIIs 1.4GHz :)
> 
> That's not so bad.
> 
> > Softwize: heartbeat-2.1.3 from opensuse.org along with pacemaker-0.6.5-1
> > and drbd-8.2.5-0. All this on Debian Etch running 2.6.22.2-i686-smp.
> > 
> > Whenever I attempt to start a 'drbdadmin verify' the load slowly goes
> > up until it hits ~4 and then the CRM reports a timeout for an apache
> > resource creating havoc:
> 
> I think that you should turn to the drbd people. The timeout
> occurs most probably because the host's busy.


I will post on the drbd list once I have eliminated a few variables...
See below.

[snip]

> > apache_id   (ocf::heartbeat:apache):  Started feeble-0 (unmanaged) FAILED
> > 
> > The whole log is online at:
> > 
> > http://www.bic.mni.mcgill.ca/~malin/heartbeat/messages-20080903.txt
> > 
> > as well as my in-house entire setup:
> > 
> > http://www.bic.mni.mcgill.ca/~malin/heartbeat/howto-heartbeat-drbd.txt
> > 
> > The ha.cf and cib.ml along with the drbd config file and status
> > while verifying are also there.
> > 
> > After that the group fs->NFS->IP->mysql->apache hangs on the node
> > rather than failover as the resource apache is reported as 'unmanaged'.
> 
> That shouldn't be the reason. This looks like a bug. Perhaps you
> can upgrade pacemaker to the latest stable and see how it
> behaves. If the same happens, please file a bugzilla and attach a
> hb_report tarball.

It's my intent to upgrade both heartbeat and pacemaker but how
do you do this on a live cluster? Put one node standby, upgrade,
put it back online and do the same for the other node?
What happens when there are nodes not exactly at the same revision 
level on a live cluster?

> 
> > Sometimes mysql will also stop but not always.
> > 
> > The only way out I found (suggested on this list) is to manually
> > remove the resource from the LRM (a failover then occurs)
> 
> Using crm_resource -C?

yep.
I now realize after some thoughts that it might be the apache RA
that's not quite ocf-complient...I will test it and report back.

> 
> > but I'd like
> > to know where is my mistake: measly hardware that can't cope with the
> > load, my HA setup not quite robust enough or should I increase the
> > timeout for apache (60s)? 
> 
> My guess is that the problem's somewhere in the drbd
> configuration/disk system. For whatever reason, verifying the
> disks (drbdadm) hogs your hosts.
> 
> Thanks,
> 
> Dejan

Thanks for the tips,
jf

> 
> 
> > Any idea?
> > Thanks!
> > jf
> > -- 
> > <? ><
> > _______________________________________________
> > Linux-HA mailing list
> > [email protected]
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > See also: http://linux-ha.org/ReportingProblems
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems

-- 
<° ><
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] problem with drbd verify and heartbeat?

Reply via email to