Hi, On Fri, May 07, 2010 at 11:17:47AM -0500, David Gersic wrote: > I'm not entirely new to Heartbeat2, but I've run in to > something here that I have not been able to figure out. What > I'm trying to do is create a JBoss resource, as part of a > resource group (disk, ip, mysql, jboss), for an application. I > have the disk, ip, and MySQL resources working, it's just the > JBoss resource that's proving to be more difficult than > expected. This particular JBoss application takes a while to > get fully started, which is where I think I'm running in to > trouble. More on that below. > > > The servers are SLES10, and I'm using Heartbeat2: > > Linux sles10-3 2.6.16.60-0.34-default #1 Fri Jan 16 14:59:01 UTC 2009 i686 > i686 i386 GNU/Linux > > heartbeat-2.1.4-0.11
Did you consider upgrading to SLE11? > I've defined my JBoss application resource: > > <primitive class="ocf" type="jboss" provider="heartbeat" is_managed="true" > id="JBoss_4"> > <instance_attributes id="JBoss_4_instance_attrs"> > <attributes> > <nvpair name="resource_name" value="IDMProv" > id="4609063b-c767-4956-a2f9-f44f46b634a9"/> > <nvpair name="console" value="/shared/uadisk/rbpm37/jboss.log" > id="a78e869b-2b00-474c-987e-5919c1ce80e7"/> > <nvpair name="shutdown_timeout" value="60" > id="16812ae8-4f3c-46df-bdbe-24f7e0fcd557"/> > <nvpair name="user" value="rbpm" > id="f8ccf1ed-c701-4dc0-b8e5-d66c779a8b9f"/> > <nvpair name="statusurl" value="http://131.156.12.4:8080/IDMProv" > id="11d32111-4e2b-4224-99d2-20af4eb43eb8"/> > <nvpair name="java_home" value="/usr/java/jre1.6.0_18/" > id="2881e29c-4b87-4094-bffc-d0d9e9682e16"/> > <nvpair name="jboss_home" value="/shared/uadisk/rbpm37/jboss" > id="d21e26f2-3e2f-4a98-9ecd-8f934b844434"/> > <nvpair name="run_opts" value="-c IDMProv -b 0.0.0.0" > id="681efa45-d7f3-4ef0-b90d-5d652b6480d6"/> > <nvpair name="shutdown_opts" value="-S" > id="f10343d3-28cc-4497-b234-e4678dda5818"/> > </attributes> > </instance_attributes> > <operations> > <op name="monitor" interval="10s" timeout="600s" start_delay="600s" > id="317075e9-dabb-4923-9129-be16882f94a4"/> > <op name="start" interval="900s" timeout="600s" start_delay="10s" > id="ab407ca5-78e4-48e9-bee0-f70f64d011e4"/> > <op name="stop" interval="10s" timeout="600s" start_delay="10s" > id="33d5a72d-105c-4da0-99cb-b25de520a5ae"/> > </operations> > </primitive> The start_delay is not needed and may confuse everybody when debugging the timing issues. If it is needed then the resource agent's start action is broken. > This has gone through many iterations over the last few days. > This is what's currently in the CIB. > > > This particular Heartbeat2 version didn't include a JBoss OCF > script, but I obtained this one from the list archives: Best to get the script from the hg.linux-ha.org/agents repository. [...] > I've modified it slightly from what was posted to add some > debugging to figure out what's going on. The "echo" statements > added are all un-indented to make them easy to spot. The script > functionality is unchanged. > > When I change this resource group from Stopped to Started, I > see the Disk, IP, and MySQL resource change to Started. I then > see the JBoss resource change to Started. Then, a few seconds > later, it changes to Stopped. > > Starting this JBoss application by hand, it takes a while to > get to the point where it's fully deployed and running. "A > while" in this case is anywhere from 3 to 10 minutes. During > that time, JBoss itself can be seen as running, but the attempt > in the OCF script to verify that the application is working > using wget will fail, since the provided URL isn't yet > available. Does the start action wait until the resource is operational? It should. > It looks like Heartbeat2 is not waiting long enough for the > application to start. By adding the debugging "echo" statements > to the jboss OCF script, I can see that first the (start) is > called. When (start_jboss) is called, it builds and fires off > the command line to start JBoss. Then it enters a loop where it > calls (monitor_jboss). The first couple of times > (monitor_jboss) is called, JBoss isn't running yet, so > (monitor_jboss) returns and the loop continues. After a few > times through this, (monitor_jboss) sees that JBoss is running > and starts calling (isrunning_jboss). (isrunning_jboss) uses > wget to see if the application is running, which it isn't yet > since it's only been a few seconds and this application takes > at least 3 minutes to get going. This loop then repeats a few > more times. We're still in (start_jboss), calling > (monitor_jboss), which calls (isrunning_jboss). > > This is where things go wrong. After about 20 seconds, > Heartbeat2 calls (stop). This then goes off and kills the JBoss > application that is still in the process of starting. After > which, the resource is marked as "Stopped" in the CIB. > > You can see this in the log written out by the "echo" commands > in the jboss OCF script: > > .................................................................................................................. > JBoss Thu May 6 15:17:35 CDT 2010 > OCF_ROOT is /usr/lib/ocf > OCF_RESKEY_resource_name is IDMProv > OCF_RESKEY_console is /shared/uadisk/rbpm37/jboss.log > OCF_RESKEY_kill_timeout is > OCF_RESKEY_user is rbpm > OCF_RESKEY_statusurl is http://131.156.12.4:8080/IDMProv > OCF_RESKEY_java_home is /usr/java/jre1.6.0_18/ > OCF_RESKEY_jboss_home is /shared/uadisk/rbpm37/jboss > OCF_RESKEY_pstring is > OCF_RESKEY_run_opts is -c IDMProv -b 0.0.0.0 > OCF_RESKEY_shutdown_opts is -S > pwd is /var/lib/heartbeat/cores/root > /JBoss > (Start Thu May 6 15:17:35 CDT 2010) > (Monitor Thu May 6 15:17:35 CDT 2010) > su - -s /bin/bash rbpm -c export JAVA_HOME=/usr/java/jre1.6.0_18/; export > JBOSS_HOME=/shared/uadisk/rbpm37/jboss; > /shared/uadisk/rbpm37/jboss/bin/run.sh -c IDMProv -b 0.0.0.0 > JBOSS_USER is rbpm > JAVA_HOME is /usr/java/jre1.6.0_18/ > JBOSS_HOME is /shared/uadisk/rbpm37/jboss > RUN_OPTS is -c IDMProv -b 0.0.0.0 > pwd is /var/lib/heartbeat/cores/root > 1. Thu May 6 15:17:35 CDT 2010 > 2. Thu May 6 15:17:35 CDT 2010 > 3. Thu May 6 15:17:35 CDT 2010 > (Monitor Thu May 6 15:17:35 CDT 2010) > 3. Thu May 6 15:17:38 CDT 2010 > (Monitor Thu May 6 15:17:38 CDT 2010) > (Is Running Thu May 6 15:17:38 CDT 2010) > 3. Thu May 6 15:17:42 CDT 2010 > (Monitor Thu May 6 15:17:42 CDT 2010) > (Is Running Thu May 6 15:17:42 CDT 2010) > 3. Thu May 6 15:17:45 CDT 2010 > (Monitor Thu May 6 15:17:45 CDT 2010) > (Is Running Thu May 6 15:17:45 CDT 2010) > 3. Thu May 6 15:17:48 CDT 2010 > (Monitor Thu May 6 15:17:48 CDT 2010) > (Is Running Thu May 6 15:17:48 CDT 2010) > 3. Thu May 6 15:17:51 CDT 2010 > (Monitor Thu May 6 15:17:51 CDT 2010) > (Is Running Thu May 6 15:17:51 CDT 2010) > 3. Thu May 6 15:17:54 CDT 2010 > (Monitor Thu May 6 15:17:54 CDT 2010) > (Is Running Thu May 6 15:17:54 CDT 2010) > .................................................................................................................. > JBoss Thu May 6 15:17:56 CDT 2010 > OCF_ROOT is /usr/lib/ocf > OCF_RESKEY_resource_name is IDMProv > OCF_RESKEY_console is /shared/uadisk/rbpm37/jboss.log > OCF_RESKEY_kill_timeout is > OCF_RESKEY_user is rbpm > OCF_RESKEY_statusurl is http://131.156.12.4:8080/IDMProv > OCF_RESKEY_java_home is /usr/java/jre1.6.0_18/ > OCF_RESKEY_jboss_home is /shared/uadisk/rbpm37/jboss > OCF_RESKEY_pstring is > OCF_RESKEY_run_opts is -c IDMProv -b 0.0.0.0 > OCF_RESKEY_shutdown_opts is -S > pwd is /var/lib/heartbeat/cores/root > /JBoss > (Stop Thu May 6 15:17:57 CDT 2010) > > > Note the times shown for "(Start" and "(Stop". They are > 15:17:35 and 15:17:57. Only 22 seconds have elapsed since > Start was called, and now Stop is being called. > > My understanding is that the OCF script is working correctly. > Its job is to start the resource, and to wait until it can > verify that the resource is running before returning to > Heartbeat2. So far so good. > The jboss OCF script is doing this, but Heartbeat2 > isn't waiting for the script command to start to return. I > have not, so far, found any way to influence this. As you can > see from the <operations> block in the CIB, I have been > cranking up the values to interval, timeout, and start_delay, > for the monitor, start and stop operations. None of these > changes seems to have any effect on what Heartbeat2 actually > does. If Start hasn't returned successful within about 20 > seconds, Heartbeat2 considers it to have timed out and kills > it. > > What am I missing here? It doesn't look like you're missing anything. If the lrmd considers the operation timed out in spite of a different timeout specified for the operation, then there seems to be a bug. Though I think that timeouts did work properly in Heartbeat 2.1.4. Since this is SLES10 and the old version of Heartbeat, you should open a support call with Novell. Thanks, Dejan > > > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
