On Thu, Sep 04, 2008 at 02:52:50PM -0400, Jean-Francois Malouin wrote: > Hi, > > Hardware: (don't laugh please) 2 Dell 1650 `powered' > by duals Intel PIIIs 1.4GHz :)
That's not so bad. > Softwize: heartbeat-2.1.3 from opensuse.org along with pacemaker-0.6.5-1 > and drbd-8.2.5-0. All this on Debian Etch running 2.6.22.2-i686-smp. > > Whenever I attempt to start a 'drbdadmin verify' the load slowly goes > up until it hits ~4 and then the CRM reports a timeout for an apache > resource creating havoc: I think that you should turn to the drbd people. The timeout occurs most probably because the host's busy. > apache_id (ocf::heartbeat:apache): Started feeble-0 (unmanaged) FAILED > > The logs show (that's from yerterday's attempt): > > Sep 3 14:48:42 feeble-1 kernel: drbd1: conn( Connected -> VerifyS ) > Sep 3 14:59:56 feeble-1 lrmd: [21967]: WARN: apache_id:monitor > process (PID 24205) timed out (try 1). Killing with signal SIGTERM (15). > Sep 3 14:59:56 feeble-1 lrmd: [21967]: WARN: operation monitor[307] > on ocf::apache::apache_id for client 21970, its parameters: CRM_met > a_interval=[60000] CRM_meta_start_delay=[0] CRM_meta_role=[Started] > CRM_meta_id=[apache-monitoring] CRM_meta_timeout=[60000] crm_feature > _set=[2.1] CRM_meta_disabled=[false] CRM_meta_name=[monitor] : pid > [24205] timed out > Sep 3 14:59:56 feeble-1 tengine: [21975]: info: process_graph_event: > Action apache_id_monitor_60000 arrived after a completed transition > Sep 3 14:59:56 feeble-1 tengine: [21975]: info: > update_abort_priority: Abort priority upgraded to 1000000 > Sep 3 14:59:56 feeble-1 tengine: [21975]: WARN: update_failcount: > Updating failcount for apache_id on d7fb07f0-a857-446d-98e6-fce91c1b6 > 094 after failed monitor: rc=-2 (update=value++) > > The whole log is online at: > > http://www.bic.mni.mcgill.ca/~malin/heartbeat/messages-20080903.txt > > as well as my in-house entire setup: > > http://www.bic.mni.mcgill.ca/~malin/heartbeat/howto-heartbeat-drbd.txt > > The ha.cf and cib.ml along with the drbd config file and status > while verifying are also there. > > After that the group fs->NFS->IP->mysql->apache hangs on the node > rather than failover as the resource apache is reported as 'unmanaged'. That shouldn't be the reason. This looks like a bug. Perhaps you can upgrade pacemaker to the latest stable and see how it behaves. If the same happens, please file a bugzilla and attach a hb_report tarball. > Sometimes mysql will also stop but not always. > > The only way out I found (suggested on this list) is to manually > remove the resource from the LRM (a failover then occurs) Using crm_resource -C? > but I'd like > to know where is my mistake: measly hardware that can't cope with the > load, my HA setup not quite robust enough or should I increase the > timeout for apache (60s)? My guess is that the problem's somewhere in the drbd configuration/disk system. For whatever reason, verifying the disks (drbdadm) hogs your hosts. Thanks, Dejan > Any idea? > Thanks! > jf > -- > <? >< > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
