>>> On 5/11/2010 at 04:36 AM, Dejan Muhamedagic <[email protected]> wrote:
>> Linux sles10-3 2.6.16.60-0.34-default #1 Fri Jan 16 14:59:01 UTC 2009 i686 >> i686 > i386 GNU/Linux >> >> heartbeat-2.1.4-0.11 > > Did you consider upgrading to SLE11? Sigh. Yes. I'm working on getting there, but first I need to get this particular application off the SLES9/Heartbeat1 cluster that it's running on now. >> <operations> >> <op name="monitor" interval="10s" timeout="600s" start_delay="600s" > id="317075e9-dabb-4923-9129-be16882f94a4"/> >> <op name="start" interval="900s" timeout="600s" start_delay="10s" > id="ab407ca5-78e4-48e9-bee0-f70f64d011e4"/> >> <op name="stop" interval="10s" timeout="600s" start_delay="10s" > id="33d5a72d-105c-4da0-99cb-b25de520a5ae"/> >> </operations> >> </primitive> > > The start_delay is not needed and may confuse everybody when > debugging the timing issues. If it is needed then the resource > agent's start action is broken. I wasn't originally using start_delay, that was just an attempt to get it to do something other than what it's doing now. I'll take it back out again since it isn't doing anything helpful. >> Note the times shown for "(Start" and "(Stop". They are >> 15:17:35 and 15:17:57. Only 22 seconds have elapsed since >> Start was called, and now Stop is being called. >> >> My understanding is that the OCF script is working correctly. >> Its job is to start the resource, and to wait until it can >> verify that the resource is running before returning to >> Heartbeat2. > > So far so good. > >> The jboss OCF script is doing this, but Heartbeat2 >> isn't waiting for the script command to start to return. I >> have not, so far, found any way to influence this. As you can >> see from the <operations> block in the CIB, I have been >> cranking up the values to interval, timeout, and start_delay, >> for the monitor, start and stop operations. None of these >> changes seems to have any effect on what Heartbeat2 actually >> does. If Start hasn't returned successful within about 20 >> seconds, Heartbeat2 considers it to have timed out and kills >> it. >> >> What am I missing here? > > It doesn't look like you're missing anything. If the lrmd > considers the operation timed out in spite of a different timeout > specified for the operation, then there seems to be a bug. Though > I think that timeouts did work properly in Heartbeat 2.1.4. Since > this is SLES10 and the old version of Heartbeat, you should open > a support call with Novell. Ok, thanks for the help Dejan. I wasn't sure if I was going crazy or if this is a bug. With confirmation that I'm not going crazy, or at least that this isn't evidence that I'm going crazy, I'll get a somewhat simpler test case put together, check to make sure Novell haven't already fixed this, and I'll get an incident open with them to get it fixed. _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
