On Thu, Sep 04, 2008 at 02:52:50PM -0400, Jean-Francois Malouin wrote:
> Hi,
> 
> Hardware: (don't laugh please) 2 Dell 1650 `powered' 
> by duals Intel PIIIs 1.4GHz :)

That's not so bad.

> Softwize: heartbeat-2.1.3 from opensuse.org along with pacemaker-0.6.5-1
> and drbd-8.2.5-0. All this on Debian Etch running 2.6.22.2-i686-smp.
> 
> Whenever I attempt to start a 'drbdadmin verify' the load slowly goes
> up until it hits ~4 and then the CRM reports a timeout for an apache
> resource creating havoc:

I think that you should turn to the drbd people. The timeout
occurs most probably because the host's busy.

> apache_id   (ocf::heartbeat:apache):  Started feeble-0 (unmanaged) FAILED
> 
> The logs show (that's from yerterday's attempt):
> 
> Sep  3 14:48:42 feeble-1 kernel: drbd1: conn( Connected -> VerifyS ) 
> Sep  3 14:59:56 feeble-1 lrmd: [21967]: WARN: apache_id:monitor
> process (PID 24205) timed out (try 1).  Killing with signal SIGTERM (15).
> Sep  3 14:59:56 feeble-1 lrmd: [21967]: WARN: operation monitor[307]
> on ocf::apache::apache_id for client 21970, its parameters: CRM_met
> a_interval=[60000] CRM_meta_start_delay=[0] CRM_meta_role=[Started]
> CRM_meta_id=[apache-monitoring] CRM_meta_timeout=[60000] crm_feature
> _set=[2.1] CRM_meta_disabled=[false] CRM_meta_name=[monitor] : pid
> [24205] timed out
> Sep  3 14:59:56 feeble-1 tengine: [21975]: info: process_graph_event:
> Action apache_id_monitor_60000 arrived after a completed transition
> Sep  3 14:59:56 feeble-1 tengine: [21975]: info:
> update_abort_priority: Abort priority upgraded to 1000000
> Sep  3 14:59:56 feeble-1 tengine: [21975]: WARN: update_failcount:
> Updating failcount for apache_id on d7fb07f0-a857-446d-98e6-fce91c1b6
> 094 after failed monitor: rc=-2 (update=value++)
> 
> The whole log is online at:
> 
> http://www.bic.mni.mcgill.ca/~malin/heartbeat/messages-20080903.txt
> 
> as well as my in-house entire setup:
> 
> http://www.bic.mni.mcgill.ca/~malin/heartbeat/howto-heartbeat-drbd.txt
> 
> The ha.cf and cib.ml along with the drbd config file and status
> while verifying are also there.
> 
> After that the group fs->NFS->IP->mysql->apache hangs on the node
> rather than failover as the resource apache is reported as 'unmanaged'.

That shouldn't be the reason. This looks like a bug. Perhaps you
can upgrade pacemaker to the latest stable and see how it
behaves. If the same happens, please file a bugzilla and attach a
hb_report tarball.

> Sometimes mysql will also stop but not always.
> 
> The only way out I found (suggested on this list) is to manually
> remove the resource from the LRM (a failover then occurs)

Using crm_resource -C?

> but I'd like
> to know where is my mistake: measly hardware that can't cope with the
> load, my HA setup not quite robust enough or should I increase the
> timeout for apache (60s)? 

My guess is that the problem's somewhere in the drbd
configuration/disk system. For whatever reason, verifying the
disks (drbdadm) hogs your hosts.

Thanks,

Dejan


> Any idea?
> Thanks!
> jf
> -- 
> <? ><
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to