Hi, Hardware: (don't laugh please) 2 Dell 1650 `powered' by duals Intel PIIIs 1.4GHz :)
Softwize: heartbeat-2.1.3 from opensuse.org along with pacemaker-0.6.5-1 and drbd-8.2.5-0. All this on Debian Etch running 2.6.22.2-i686-smp. Whenever I attempt to start a 'drbdadmin verify' the load slowly goes up until it hits ~4 and then the CRM reports a timeout for an apache resource creating havoc: apache_id (ocf::heartbeat:apache): Started feeble-0 (unmanaged) FAILED The logs show (that's from yerterday's attempt): Sep 3 14:48:42 feeble-1 kernel: drbd1: conn( Connected -> VerifyS ) Sep 3 14:59:56 feeble-1 lrmd: [21967]: WARN: apache_id:monitor process (PID 24205) timed out (try 1). Killing with signal SIGTERM (15). Sep 3 14:59:56 feeble-1 lrmd: [21967]: WARN: operation monitor[307] on ocf::apache::apache_id for client 21970, its parameters: CRM_met a_interval=[60000] CRM_meta_start_delay=[0] CRM_meta_role=[Started] CRM_meta_id=[apache-monitoring] CRM_meta_timeout=[60000] crm_feature _set=[2.1] CRM_meta_disabled=[false] CRM_meta_name=[monitor] : pid [24205] timed out Sep 3 14:59:56 feeble-1 tengine: [21975]: info: process_graph_event: Action apache_id_monitor_60000 arrived after a completed transition Sep 3 14:59:56 feeble-1 tengine: [21975]: info: update_abort_priority: Abort priority upgraded to 1000000 Sep 3 14:59:56 feeble-1 tengine: [21975]: WARN: update_failcount: Updating failcount for apache_id on d7fb07f0-a857-446d-98e6-fce91c1b6 094 after failed monitor: rc=-2 (update=value++) The whole log is online at: http://www.bic.mni.mcgill.ca/~malin/heartbeat/messages-20080903.txt as well as my in-house entire setup: http://www.bic.mni.mcgill.ca/~malin/heartbeat/howto-heartbeat-drbd.txt The ha.cf and cib.ml along with the drbd config file and status while verifying are also there. After that the group fs->NFS->IP->mysql->apache hangs on the node rather than failover as the resource apache is reported as 'unmanaged'. Sometimes mysql will also stop but not always. The only way out I found (suggested on this list) is to manually remove the resource from the LRM (a failover then occurs) but I'd like to know where is my mistake: measly hardware that can't cope with the load, my HA setup not quite robust enough or should I increase the timeout for apache (60s)? Any idea? Thanks! jf -- <° >< _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
