Dejan Muhamedagic wrote:
> Hi,
> 
> On Mon, Apr 28, 2008 at 08:18:32AM +0200, Johan Hoeke wrote:
>> Hi All,
>>
>> We're running a 2 node heartbeat-2.1.3-3.el4.centos cluster. One of our
>> resources look like this:
>>
>> crm_resource -x -r R_blackboard_init
>> R_blackboard_init    (lsb:blackboard):       Started julia.uvt.nl
>> raw xml:
>>  <primitive class="lsb" type="blackboard" provider="heartbeat"
>> id="R_blackboard_init">
>>    <instance_attributes id="R_blackboard_init_instance_attrs">
>>      <attributes/>
>>    </instance_attributes>
>>    <operations>
>>      <op name="start" start_delay="0" disabled="false" role="Started"
>> id="39658f92-1bce-4429-9df0-06ca53a31477" timeout="180"/>
>>      <op name="monitor" role="Started" start_delay="20s" timeout="90s"
>> interval="120s" disabled="false"
>> id="fbf56ef2-04ab-4e9f-9626-0a6d660dc0f0" on_fail="restart"/>
>>    </operations>
>>    <instance_attributes id="R_blackboard_init">
>>      <attributes>
>>        <nvpair name="is_managed" id="R_blackboard_init-is_managed"
>> value="true"/>
>>        <nvpair name="target_role" id="R_blackboard_init-target_role"
>> value="started"/>
>>      </attributes>
>>    </instance_attributes>
>>    <meta_attributes id="R_blackboard_init_meta_attrs">
>>      <attributes/>
>>    </meta_attributes>
>>  </primitive>
>>
>> The status part of the attached lsb blackboard init script actually does
>> something w/ wget:
>>
>> echo "foo `wget -T $TIMEOUT -q $URL -O - `"| grep -q "Blackboard
>> Learning System"
>>
>> in order to check that the blackboard app is still up.
>>
>> Here's my problem:
>>
>> Every so often, last occurrence 60 hours after restarting the monitor
>> operation, the monitor operation will timeout for no apparent reason and
>> restart the application.
>>
>> Apr 27 00:18:45 julia lrmd: [2362]: WARN: R_blackboard_init:monitor
>> process (PID 19713) timed out (try 1).  Killing with signal SIGTERM (15)
>> .
>> Apr 27 00:18:45 julia lrmd: [2362]: WARN: Managed
>> R_blackboard_init:monitor process 19713 killed by signal 15 [SIGTERM -
>> Termination (ANSI)]
>> .
>> Apr 27 00:18:45 julia lrmd: [2362]: WARN: operation monitor[84] on
>> lsb::blackboard::R_blackboard_init for client 2365, its parameters: is_ma
>> naged=[true] target_role=[started] CRM_meta_interval=[120000]
>> CRM_meta_role=[Started] CRM_meta_start_delay=[20000]
>> CRM_meta_id=[fbf56ef2-04a
>> b-4e9f-9626-0a6d660dc0f0] CRM_meta_on_fail=[restart]
>> CRM_meta_timeout=[110000] crm_feature_set=[2.0]
>> CRM_meta_disabled=[false] CRM_meta_name
>> =[monitor] : pid [19713] timed out
>> Apr 27 00:18:45 julia crmd: [2365]: ERROR: process_lrm_event: LRM
>> operation R_blackboard_init_monitor_120000 (84) Timed Out (timeout=110000m
>> s)
>>
>> In order to debug this I have added some extra logs. This is log of a
>> successful monitor operation:
>>
>> Mon Apr 28 08:01:37 CEST 2008 start monitor op
>> Mon Apr 28 08:01:37 CEST 2008 start wget op
>> Mon Apr 28 08:01:37 CEST 2008 end wget op
>> Mon Apr 28 08:01:37 CEST 2008 blackboard is running
>> Mon Apr 28 08:01:37 CEST 2008 end monitor op
>> (end)
>>
>> This is the log of the monitor operation that timed out:
>>
>> Sun Apr 27 00:16:55 CEST 2008 start monitor op
>> Sun Apr 27 00:16:55 CEST 2008 start wget op
>> (end)
>>
>> It looks to me like the monitor operation just dies silently after
>> starting the wget. Then after 120s the timeout occurs and the restart is
>> triggered.
> 
>> Anybody have an idea why this occurs?
> 
> The operation definitely times out, probably because wget blocks
> (what's your $TIMEOUT set to?). The monitor operation didn't

TIMEOUT=18

> disappear and wouldn't without lrmd noticing. All processes are
> managed.

ok, that makes me feel better i guess ;)

The wget goes to localhost and normally responds the same second. See
the example logging above. I don't get why sometimes it doesn't respond
at all, even though I have a timeout set and I'm stuffing the input for
grep with  some extra text to make sure that the grep doesn't block things:

URL=http://localhost/webapps/login/
TIMEOUT=18

echo "foo `wget -T $TIMEOUT -q $URL -O - `"| grep -q "Blackboard
Learning System"

that's why I was leaning towards something to do with heartbeat but your
answer assures me that I need to look elsewhere.

> 
> If you want to debug the script, use set -x to see what's going
> on.

nod, I'll do that when I'm at work again tommorow. I'll set up a cronjob
with the same test and the set -x to recreate the problem outside of the
heartbeat setting.

thanks,
regards,

Johan

Attachment: signature.asc
Description: OpenPGP digital signature

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to