On Fri, Sep 05, 2008 at 09:58:58AM -0400, Jean-Francois Malouin wrote: > * Dejan Muhamedagic <[EMAIL PROTECTED]> [20080905 06:58]: > > On Thu, Sep 04, 2008 at 02:52:50PM -0400, Jean-Francois Malouin wrote: [...] > > > apache_id (ocf::heartbeat:apache): Started feeble-0 (unmanaged) FAILED > > > > > > The whole log is online at: > > > > > > http://www.bic.mni.mcgill.ca/~malin/heartbeat/messages-20080903.txt > > > > > > as well as my in-house entire setup: > > > > > > http://www.bic.mni.mcgill.ca/~malin/heartbeat/howto-heartbeat-drbd.txt > > > > > > The ha.cf and cib.ml along with the drbd config file and status > > > while verifying are also there. > > > > > > After that the group fs->NFS->IP->mysql->apache hangs on the node > > > rather than failover as the resource apache is reported as 'unmanaged'. > > > > That shouldn't be the reason. This looks like a bug. Perhaps you > > can upgrade pacemaker to the latest stable and see how it > > behaves. If the same happens, please file a bugzilla and attach a > > hb_report tarball. > > It's my intent to upgrade both heartbeat and pacemaker but how > do you do this on a live cluster? Put one node standby, > upgrade, put it back online and do the same for the other node?
If you want to keep your resources running without failover, try this: http://www.linux-ha.org/TransparentUpgrade > What happens when there are nodes not exactly at the same revision > level on a live cluster? The cluster should be able to handle that. > > > > > Sometimes mysql will also stop but not always. > > > > > > The only way out I found (suggested on this list) is to manually > > > remove the resource from the LRM (a failover then occurs) > > > > Using crm_resource -C? > > yep. > I now realize after some thoughts that it might be the apache RA > that's not quite ocf-complient...I will test it and report back. Hmm, if it's the ocf/heartbeat/apache then it definitely should be. > > > but I'd like > > > to know where is my mistake: measly hardware that can't cope with the > > > load, my HA setup not quite robust enough or should I increase the > > > timeout for apache (60s)? > > > > My guess is that the problem's somewhere in the drbd > > configuration/disk system. For whatever reason, verifying the > > disks (drbdadm) hogs your hosts. > > > > Thanks, > > > > Dejan > > Thanks for the tips, > jf Welcome. Dejan _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
