>>> On 5/11/2010 at 04:36 AM, Dejan Muhamedagic <[email protected]> wrote: 

>> Linux sles10-3 2.6.16.60-0.34-default #1 Fri Jan 16 14:59:01 UTC 2009 i686 
>> i686 
> i386 GNU/Linux
>> 
>> heartbeat-2.1.4-0.11
> 
> Did you consider upgrading to SLE11?

Sigh. Yes. I'm working on getting there, but first I need to get this 
particular application off the SLES9/Heartbeat1 cluster that it's running on 
now.


>>  <operations>
>>   <op name="monitor" interval="10s" timeout="600s" start_delay="600s" 
> id="317075e9-dabb-4923-9129-be16882f94a4"/>
>>   <op name="start" interval="900s" timeout="600s" start_delay="10s" 
> id="ab407ca5-78e4-48e9-bee0-f70f64d011e4"/>
>>   <op name="stop" interval="10s" timeout="600s" start_delay="10s" 
> id="33d5a72d-105c-4da0-99cb-b25de520a5ae"/>
>>  </operations>
>> </primitive>
> 
> The start_delay is not needed and may confuse everybody when
> debugging the timing issues. If it is needed then the resource
> agent's start action is broken.

I wasn't originally using start_delay, that was just an attempt to get it to do 
something other than what it's doing now. I'll take it back out again since it 
isn't doing anything helpful.


>> Note the times shown for "(Start" and "(Stop". They are
>> 15:17:35  and  15:17:57. Only 22 seconds have elapsed since
>> Start was called, and now Stop is being called.
>> 
>> My understanding is that the OCF script is working correctly.
>> Its job is to start the resource, and to wait until it can
>> verify that the resource is running before returning to
>> Heartbeat2.
> 
> So far so good.
> 
>> The jboss OCF script is doing this, but Heartbeat2
>> isn't waiting for the script command to start  to return. I
>> have not, so far, found any way to influence this. As you can
>> see from the <operations> block in the CIB, I have been
>> cranking up the values to interval, timeout, and start_delay,
>> for the monitor, start and stop operations. None of these
>> changes seems to have any effect on what Heartbeat2 actually
>> does. If Start hasn't returned successful within about 20
>> seconds, Heartbeat2 considers it to have timed out and kills
>> it.
>> 
>> What am I missing here?
> 
> It doesn't look like you're missing anything. If the lrmd
> considers the operation timed out in spite of a different timeout
> specified for the operation, then there seems to be a bug. Though
> I think that timeouts did work properly in Heartbeat 2.1.4. Since
> this is SLES10 and the old version of Heartbeat, you should open
> a support call with Novell.


Ok, thanks for the help Dejan. I wasn't sure if I was going crazy or if this is 
a bug. With confirmation that I'm not going crazy, or at least that this isn't 
evidence that I'm going crazy, I'll get a somewhat simpler test case put 
together, check to make sure Novell haven't already fixed this, and I'll get an 
incident open with them to get it fixed.




_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to