Hi,

On Fri, May 07, 2010 at 11:17:47AM -0500, David Gersic wrote:
> I'm not entirely new to Heartbeat2, but I've run in to
> something here that I have not been able to figure out. What
> I'm trying to do is create a JBoss resource, as part of a
> resource group (disk, ip, mysql, jboss), for an application. I
> have the disk, ip, and MySQL resources working, it's just the
> JBoss resource that's proving to be more difficult than
> expected. This particular JBoss application takes a while to
> get fully started, which is where I think I'm running in to
> trouble. More on that below.
> 
> 
> The servers are SLES10, and I'm using Heartbeat2:
> 
> Linux sles10-3 2.6.16.60-0.34-default #1 Fri Jan 16 14:59:01 UTC 2009 i686 
> i686 i386 GNU/Linux
> 
> heartbeat-2.1.4-0.11

Did you consider upgrading to SLE11?

> I've defined my JBoss application resource:
> 
> <primitive class="ocf" type="jboss" provider="heartbeat" is_managed="true" 
> id="JBoss_4">
>  <instance_attributes id="JBoss_4_instance_attrs">
>   <attributes>
>    <nvpair name="resource_name" value="IDMProv" 
> id="4609063b-c767-4956-a2f9-f44f46b634a9"/>
>    <nvpair name="console" value="/shared/uadisk/rbpm37/jboss.log" 
> id="a78e869b-2b00-474c-987e-5919c1ce80e7"/>
>    <nvpair name="shutdown_timeout" value="60" 
> id="16812ae8-4f3c-46df-bdbe-24f7e0fcd557"/>
>    <nvpair name="user" value="rbpm" 
> id="f8ccf1ed-c701-4dc0-b8e5-d66c779a8b9f"/>
>    <nvpair name="statusurl" value="http://131.156.12.4:8080/IDMProv"; 
> id="11d32111-4e2b-4224-99d2-20af4eb43eb8"/>
>    <nvpair name="java_home" value="/usr/java/jre1.6.0_18/" 
> id="2881e29c-4b87-4094-bffc-d0d9e9682e16"/>
>    <nvpair name="jboss_home" value="/shared/uadisk/rbpm37/jboss" 
> id="d21e26f2-3e2f-4a98-9ecd-8f934b844434"/>
>    <nvpair name="run_opts" value="-c IDMProv -b 0.0.0.0" 
> id="681efa45-d7f3-4ef0-b90d-5d652b6480d6"/>
>    <nvpair name="shutdown_opts" value="-S" 
> id="f10343d3-28cc-4497-b234-e4678dda5818"/>
>   </attributes>
>  </instance_attributes>
>  <operations>
>   <op name="monitor" interval="10s" timeout="600s" start_delay="600s" 
> id="317075e9-dabb-4923-9129-be16882f94a4"/>
>   <op name="start" interval="900s" timeout="600s" start_delay="10s" 
> id="ab407ca5-78e4-48e9-bee0-f70f64d011e4"/>
>   <op name="stop" interval="10s" timeout="600s" start_delay="10s" 
> id="33d5a72d-105c-4da0-99cb-b25de520a5ae"/>
>  </operations>
> </primitive>

The start_delay is not needed and may confuse everybody when
debugging the timing issues. If it is needed then the resource
agent's start action is broken.

> This has gone through many iterations over the last few days.
> This is what's currently in the CIB.
> 
> 
> This particular Heartbeat2 version didn't include a JBoss OCF
> script, but I obtained this one from the list archives:

Best to get the script from the hg.linux-ha.org/agents
repository.

[...]

> I've modified it slightly from what was posted to add some
> debugging to figure out what's going on. The "echo" statements
> added are all un-indented to make them easy to spot. The script
> functionality is unchanged.
> 
> When I change this resource group from Stopped to Started, I
> see the Disk, IP, and MySQL resource change to Started. I then
> see the JBoss resource change to Started. Then, a few seconds
> later, it changes to Stopped.
> 
> Starting this JBoss application by hand, it takes a while to
> get to the point where it's fully deployed and running. "A
> while" in this case is anywhere from 3 to 10 minutes. During
> that time, JBoss itself can be seen as running, but the attempt
> in the OCF script to verify that the application is working
> using wget will fail, since the provided URL isn't yet
> available.

Does the start action wait until the resource is operational? It
should.

> It looks like Heartbeat2 is not waiting long enough for the
> application to start. By adding the debugging "echo" statements
> to the jboss OCF script, I can see that first the (start) is
> called. When (start_jboss) is called, it builds and fires off
> the command line to start JBoss. Then it enters a loop where it
> calls (monitor_jboss). The first couple of times
> (monitor_jboss) is called, JBoss isn't running yet, so
> (monitor_jboss) returns and the loop continues. After a few
> times through this, (monitor_jboss) sees that JBoss is running
> and starts calling (isrunning_jboss). (isrunning_jboss) uses
> wget to see if the application is running, which it isn't yet
> since it's only been a few seconds and this application takes
> at least 3 minutes to get going. This loop then repeats a few
> more times. We're still in (start_jboss), calling
> (monitor_jboss), which calls (isrunning_jboss).
> 
> This is where things go wrong. After about 20 seconds,
> Heartbeat2 calls (stop). This then goes off and kills the JBoss
> application that is still in the process of starting. After
> which, the resource is marked as "Stopped" in the CIB.
> 
> You can see this in the log written out by the "echo" commands
> in the jboss OCF script:
> 
> ..................................................................................................................
> JBoss   Thu May 6 15:17:35 CDT 2010
> OCF_ROOT is /usr/lib/ocf
> OCF_RESKEY_resource_name is IDMProv
> OCF_RESKEY_console is /shared/uadisk/rbpm37/jboss.log
> OCF_RESKEY_kill_timeout is
> OCF_RESKEY_user is rbpm
> OCF_RESKEY_statusurl is http://131.156.12.4:8080/IDMProv
> OCF_RESKEY_java_home is /usr/java/jre1.6.0_18/
> OCF_RESKEY_jboss_home is /shared/uadisk/rbpm37/jboss
> OCF_RESKEY_pstring is 
> OCF_RESKEY_run_opts is -c IDMProv -b 0.0.0.0
> OCF_RESKEY_shutdown_opts is -S
> pwd is /var/lib/heartbeat/cores/root
> /JBoss
> (Start Thu May  6 15:17:35 CDT 2010)
> (Monitor Thu May  6 15:17:35 CDT 2010)
> su - -s /bin/bash rbpm -c export JAVA_HOME=/usr/java/jre1.6.0_18/; export 
> JBOSS_HOME=/shared/uadisk/rbpm37/jboss; 
> /shared/uadisk/rbpm37/jboss/bin/run.sh -c IDMProv -b 0.0.0.0
>   JBOSS_USER is rbpm
>   JAVA_HOME is /usr/java/jre1.6.0_18/
>   JBOSS_HOME is /shared/uadisk/rbpm37/jboss
>   RUN_OPTS is -c IDMProv -b 0.0.0.0
>   pwd is /var/lib/heartbeat/cores/root
>   1. Thu May  6 15:17:35 CDT 2010
>   2. Thu May  6 15:17:35 CDT 2010
>   3. Thu May  6 15:17:35 CDT 2010
> (Monitor Thu May  6 15:17:35 CDT 2010)
>   3. Thu May  6 15:17:38 CDT 2010
> (Monitor Thu May  6 15:17:38 CDT 2010)
> (Is Running Thu May  6 15:17:38 CDT 2010)
>   3. Thu May  6 15:17:42 CDT 2010
> (Monitor Thu May  6 15:17:42 CDT 2010)
> (Is Running Thu May  6 15:17:42 CDT 2010)
>   3. Thu May  6 15:17:45 CDT 2010
> (Monitor Thu May  6 15:17:45 CDT 2010)
> (Is Running Thu May  6 15:17:45 CDT 2010)
>   3. Thu May  6 15:17:48 CDT 2010
> (Monitor Thu May  6 15:17:48 CDT 2010)
> (Is Running Thu May  6 15:17:48 CDT 2010)
>   3. Thu May  6 15:17:51 CDT 2010
> (Monitor Thu May  6 15:17:51 CDT 2010)
> (Is Running Thu May  6 15:17:51 CDT 2010)
>   3. Thu May  6 15:17:54 CDT 2010
> (Monitor Thu May  6 15:17:54 CDT 2010)
> (Is Running Thu May  6 15:17:54 CDT 2010)
> ..................................................................................................................
> JBoss   Thu May 6 15:17:56 CDT 2010
> OCF_ROOT is /usr/lib/ocf
> OCF_RESKEY_resource_name is IDMProv
> OCF_RESKEY_console is /shared/uadisk/rbpm37/jboss.log
> OCF_RESKEY_kill_timeout is
> OCF_RESKEY_user is rbpm
> OCF_RESKEY_statusurl is http://131.156.12.4:8080/IDMProv
> OCF_RESKEY_java_home is /usr/java/jre1.6.0_18/
> OCF_RESKEY_jboss_home is /shared/uadisk/rbpm37/jboss
> OCF_RESKEY_pstring is 
> OCF_RESKEY_run_opts is -c IDMProv -b 0.0.0.0
> OCF_RESKEY_shutdown_opts is -S
> pwd is /var/lib/heartbeat/cores/root
> /JBoss
> (Stop Thu May  6 15:17:57 CDT 2010)
> 
> 
> Note the times shown for "(Start" and "(Stop". They are
> 15:17:35  and  15:17:57. Only 22 seconds have elapsed since
> Start was called, and now Stop is being called.
> 
> My understanding is that the OCF script is working correctly.
> Its job is to start the resource, and to wait until it can
> verify that the resource is running before returning to
> Heartbeat2.

So far so good.

> The jboss OCF script is doing this, but Heartbeat2
> isn't waiting for the script command to start  to return. I
> have not, so far, found any way to influence this. As you can
> see from the <operations> block in the CIB, I have been
> cranking up the values to interval, timeout, and start_delay,
> for the monitor, start and stop operations. None of these
> changes seems to have any effect on what Heartbeat2 actually
> does. If Start hasn't returned successful within about 20
> seconds, Heartbeat2 considers it to have timed out and kills
> it.
> 
> What am I missing here?

It doesn't look like you're missing anything. If the lrmd
considers the operation timed out in spite of a different timeout
specified for the operation, then there seems to be a bug. Though
I think that timeouts did work properly in Heartbeat 2.1.4. Since
this is SLES10 and the old version of Heartbeat, you should open
a support call with Novell.

Thanks,

Dejan

> 
> 
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to