Dejan Muhamedagic wrote: > Hi, > > On Mon, Apr 28, 2008 at 08:18:32AM +0200, Johan Hoeke wrote: >> Hi All, >> >> We're running a 2 node heartbeat-2.1.3-3.el4.centos cluster. One of our >> resources look like this: >> >> crm_resource -x -r R_blackboard_init >> R_blackboard_init (lsb:blackboard): Started julia.uvt.nl >> raw xml: >> <primitive class="lsb" type="blackboard" provider="heartbeat" >> id="R_blackboard_init"> >> <instance_attributes id="R_blackboard_init_instance_attrs"> >> <attributes/> >> </instance_attributes> >> <operations> >> <op name="start" start_delay="0" disabled="false" role="Started" >> id="39658f92-1bce-4429-9df0-06ca53a31477" timeout="180"/> >> <op name="monitor" role="Started" start_delay="20s" timeout="90s" >> interval="120s" disabled="false" >> id="fbf56ef2-04ab-4e9f-9626-0a6d660dc0f0" on_fail="restart"/> >> </operations> >> <instance_attributes id="R_blackboard_init"> >> <attributes> >> <nvpair name="is_managed" id="R_blackboard_init-is_managed" >> value="true"/> >> <nvpair name="target_role" id="R_blackboard_init-target_role" >> value="started"/> >> </attributes> >> </instance_attributes> >> <meta_attributes id="R_blackboard_init_meta_attrs"> >> <attributes/> >> </meta_attributes> >> </primitive> >> >> The status part of the attached lsb blackboard init script actually does >> something w/ wget: >> >> echo "foo `wget -T $TIMEOUT -q $URL -O - `"| grep -q "Blackboard >> Learning System" >> >> in order to check that the blackboard app is still up. >> >> Here's my problem: >> >> Every so often, last occurrence 60 hours after restarting the monitor >> operation, the monitor operation will timeout for no apparent reason and >> restart the application. >> >> Apr 27 00:18:45 julia lrmd: [2362]: WARN: R_blackboard_init:monitor >> process (PID 19713) timed out (try 1). Killing with signal SIGTERM (15) >> . >> Apr 27 00:18:45 julia lrmd: [2362]: WARN: Managed >> R_blackboard_init:monitor process 19713 killed by signal 15 [SIGTERM - >> Termination (ANSI)] >> . >> Apr 27 00:18:45 julia lrmd: [2362]: WARN: operation monitor[84] on >> lsb::blackboard::R_blackboard_init for client 2365, its parameters: is_ma >> naged=[true] target_role=[started] CRM_meta_interval=[120000] >> CRM_meta_role=[Started] CRM_meta_start_delay=[20000] >> CRM_meta_id=[fbf56ef2-04a >> b-4e9f-9626-0a6d660dc0f0] CRM_meta_on_fail=[restart] >> CRM_meta_timeout=[110000] crm_feature_set=[2.0] >> CRM_meta_disabled=[false] CRM_meta_name >> =[monitor] : pid [19713] timed out >> Apr 27 00:18:45 julia crmd: [2365]: ERROR: process_lrm_event: LRM >> operation R_blackboard_init_monitor_120000 (84) Timed Out (timeout=110000m >> s) >> >> In order to debug this I have added some extra logs. This is log of a >> successful monitor operation: >> >> Mon Apr 28 08:01:37 CEST 2008 start monitor op >> Mon Apr 28 08:01:37 CEST 2008 start wget op >> Mon Apr 28 08:01:37 CEST 2008 end wget op >> Mon Apr 28 08:01:37 CEST 2008 blackboard is running >> Mon Apr 28 08:01:37 CEST 2008 end monitor op >> (end) >> >> This is the log of the monitor operation that timed out: >> >> Sun Apr 27 00:16:55 CEST 2008 start monitor op >> Sun Apr 27 00:16:55 CEST 2008 start wget op >> (end) >> >> It looks to me like the monitor operation just dies silently after >> starting the wget. Then after 120s the timeout occurs and the restart is >> triggered. > >> Anybody have an idea why this occurs? > > The operation definitely times out, probably because wget blocks > (what's your $TIMEOUT set to?). The monitor operation didn't
TIMEOUT=18 > disappear and wouldn't without lrmd noticing. All processes are > managed. ok, that makes me feel better i guess ;) The wget goes to localhost and normally responds the same second. See the example logging above. I don't get why sometimes it doesn't respond at all, even though I have a timeout set and I'm stuffing the input for grep with some extra text to make sure that the grep doesn't block things: URL=http://localhost/webapps/login/ TIMEOUT=18 echo "foo `wget -T $TIMEOUT -q $URL -O - `"| grep -q "Blackboard Learning System" that's why I was leaning towards something to do with heartbeat but your answer assures me that I need to look elsewhere. > > If you want to debug the script, use set -x to see what's going > on. nod, I'll do that when I'm at work again tommorow. I'll set up a cronjob with the same test and the set -x to recreate the problem outside of the heartbeat setting. thanks, regards, Johan
signature.asc
Description: OpenPGP digital signature
_______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
