Hi,

Hardware: (don't laugh please) 2 Dell 1650 `powered' 
by duals Intel PIIIs 1.4GHz :)

Softwize: heartbeat-2.1.3 from opensuse.org along with pacemaker-0.6.5-1
and drbd-8.2.5-0. All this on Debian Etch running 2.6.22.2-i686-smp.

Whenever I attempt to start a 'drbdadmin verify' the load slowly goes
up until it hits ~4 and then the CRM reports a timeout for an apache
resource creating havoc:

apache_id   (ocf::heartbeat:apache):  Started feeble-0 (unmanaged) FAILED

The logs show (that's from yerterday's attempt):

Sep  3 14:48:42 feeble-1 kernel: drbd1: conn( Connected -> VerifyS ) 
Sep  3 14:59:56 feeble-1 lrmd: [21967]: WARN: apache_id:monitor
process (PID 24205) timed out (try 1).  Killing with signal SIGTERM (15).
Sep  3 14:59:56 feeble-1 lrmd: [21967]: WARN: operation monitor[307]
on ocf::apache::apache_id for client 21970, its parameters: CRM_met
a_interval=[60000] CRM_meta_start_delay=[0] CRM_meta_role=[Started]
CRM_meta_id=[apache-monitoring] CRM_meta_timeout=[60000] crm_feature
_set=[2.1] CRM_meta_disabled=[false] CRM_meta_name=[monitor] : pid
[24205] timed out
Sep  3 14:59:56 feeble-1 tengine: [21975]: info: process_graph_event:
Action apache_id_monitor_60000 arrived after a completed transition
Sep  3 14:59:56 feeble-1 tengine: [21975]: info:
update_abort_priority: Abort priority upgraded to 1000000
Sep  3 14:59:56 feeble-1 tengine: [21975]: WARN: update_failcount:
Updating failcount for apache_id on d7fb07f0-a857-446d-98e6-fce91c1b6
094 after failed monitor: rc=-2 (update=value++)

The whole log is online at:

http://www.bic.mni.mcgill.ca/~malin/heartbeat/messages-20080903.txt

as well as my in-house entire setup:

http://www.bic.mni.mcgill.ca/~malin/heartbeat/howto-heartbeat-drbd.txt

The ha.cf and cib.ml along with the drbd config file and status
while verifying are also there.

After that the group fs->NFS->IP->mysql->apache hangs on the node
rather than failover as the resource apache is reported as 'unmanaged'.
Sometimes mysql will also stop but not always.

The only way out I found (suggested on this list) is to manually
remove the resource from the LRM (a failover then occurs) but I'd like
to know where is my mistake: measly hardware that can't cope with the
load, my HA setup not quite robust enough or should I increase the
timeout for apache (60s)? 

Any idea?
Thanks!
jf
-- 
<° ><
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to