Hi,

On Fri, Jul 09, 2010 at 04:01:14PM +0100, Brett Delle Grazie wrote:
> Hi,
> 
> Yes I meant parent of the process is init.
> 
> Timeout for start operation is: 120 seconds - process is still
> around (so is tomcat) after this.

According to these log messages:

> Jul 09 15:39:47 fmp-dun-tapp1 lrmd: [4264]: info: rsc:tomcat_tc1:0:184: start
> Jul 09 15:39:51 fmp-dun-tapp1 crmd: [4267]: info: process_lrm_event: LRM 
> operation tomcat_tc1:0_start_0 (call=184, rc=0, cib-update=208, 
> confirmed=true) ok
> Jul 09 15:47:43 fmp-dun-tapp1 lrmd: [4264]: info: rsc:tomcat_tc1:0:186: stop
> Jul 09 15:47:49 fmp-dun-tapp1 crmd: [4267]: info: process_lrm_event: LRM 
> operation tomcat_tc1:0_stop_0 (call=186, rc=0, cib-update=210, 
> confirmed=true) ok
> Jul 09 15:47:50 fmp-dun-tapp1 crmd: [4267]: info: do_lrm_rsc_op: Performing 
> key=23:227:0:2c2c0209-48a9-40d0-b4f9-53ea1adcd584 op=tomcat_tc1:0_start_0 )
> Jul 09 15:47:50 fmp-dun-tapp1 lrmd: [4264]: info: rsc:tomcat_tc1:0:187: start
> Jul 09 15:47:54 fmp-dun-tapp1 crmd: [4267]: info: process_lrm_event: LRM 
> operation tomcat_tc1:0_start_0 (call=187, rc=0, cib-update=211, 
> confirmed=true) ok

everything seems to run fine. The start operation takes around 4
seconds. It seems to exit just fine. I really don't know what's
happening. Can you run this again with debug turned on. Then
produce a hb_report. If it's too big to send to the list, you can
open a bugzilla.

> heartbeat-libs - taken from linbit repo packages where we have
> a support contract as we use DRBD in other stuff - 

OK. Perhaps you can upgrade from linbit? Though I can't recall
ever seeing anything like this.

Thanks,

Dejan

> I didn't realise they were renamed to cluster-libs on
> clusterlabs.org.
> Hmm. I can update but can the packages on clusterlabs still use heartbeat or 
> will I need to switch to corosync?
> 
> The log from pacemaker / heartbeat is:
> /var/log/ha-debug
> Jul 09 15:39:47 fmp-dun-tapp1 lrmd: [4264]: info: rsc:tomcat_tc1:0:184: start
> Jul 09 15:39:47 fmp-dun-tapp1 attrd: [4266]: info: attrd_ha_callback: flush 
> message from fmp-dun-tapp2
> Jul 09 15:39:47 fmp-dun-tapp1 attrd: [4266]: info: attrd_ha_callback: flush 
> message from fmp-dun-tapp2
> Jul 09 15:39:47 fmp-dun-tapp1 attrd: [4266]: info: attrd_ha_callback: flush 
> message from fmp-dun-tapp2
> Jul 09 15:39:47 fmp-dun-tapp1 attrd: [4266]: info: attrd_ha_callback: flush 
> message from fmp-dun-tapp2
> Jul 09 15:39:47 fmp-dun-tapp1 attrd: [4266]: info: attrd_ha_callback: flush 
> message from fmp-dun-tapp2
> Jul 09 15:39:47 fmp-dun-tapp1 attrd: [4266]: info: attrd_ha_callback: flush 
> message from fmp-dun-tapp2
> Jul 09 15:39:47 fmp-dun-tapp1 attrd: [4266]: info: attrd_ha_callback: flush 
> message from fmp-dun-tapp2
> Jul 09 15:39:47 fmp-dun-tapp1 attrd: [4266]: info: attrd_ha_callback: flush 
> message from fmp-dun-tapp2
> Jul 09 15:39:48 fmp-dun-tapp1 attrd: [4266]: info: attrd_ha_callback: flush 
> message from fmp-dun-tapp2
> Jul 09 15:39:48 fmp-dun-tapp1 attrd: [4266]: info: attrd_ha_callback: flush 
> message from fmp-dun-tapp2
> Jul 09 15:39:48 fmp-dun-tapp1 attrd: [4266]: info: attrd_ha_callback: flush 
> message from fmp-dun-tapp2
> Jul 09 15:39:51 fmp-dun-tapp1 crmd: [4267]: info: process_lrm_event: LRM 
> operation tomcat_tc1:0_start_0 (call=184, rc=0, cib-update=208, 
> confirmed=true) ok
> Jul 09 15:39:51 fmp-dun-tapp1 crmd: [4267]: info: do_lrm_rsc_op: Performing 
> key=24:225:0:2c2c0209-48a9-40d0-b4f9-53ea1adcd584 
> op=tomcat_tc1:0_monitor_10000 )
> Jul 09 15:39:51 fmp-dun-tapp1 lrmd: [4264]: info: rsc:tomcat_tc1:0:185: 
> monitor
> Jul 09 15:39:51 fmp-dun-tapp1 crmd: [4267]: info: process_lrm_event: LRM 
> operation tomcat_tc1:0_monitor_10000 (call=185, rc=0, cib-update=209, 
> confirmed=false) ok
> Jul 09 15:40:16 fmp-dun-tapp1 lrmd: [4264]: info: rsc:cmon_html0:0:174: 
> monitor
> Jul 09 15:42:12 fmp-dun-tapp1 cib: [4263]: info: cib_stats: Processed 198 
> operations (909.00us average, 0% utilization) in the last 10min
> Jul 09 15:47:43 fmp-dun-tapp1 lrmd: [4264]: info: cancel_op: operation 
> monitor[185] on ocf::tomcat::tomcat_tc1:0 for client 4267, its parameters: 
> CRM_meta_interval=[10000] catalina_home=[/opt/tomcat] 
> catalina_base=[/home/tomcat/tc-1] tomcat_user=[tomcat] 
> catalina_pid=[/home/tomcat/tc-1/temp/tomcat.pid] catalina_rotate_log=[YES] 
> CRM_meta_timeout=[30000] CRM_meta_clone_max=[2] crm_feature_set=[3.0.1] 
> java_home=[/usr/lib/jvm/java] CRM_meta_globally_unique=[false] 
> CRM_meta_name=[monitor] script_log=[/home/tomcat/tc-1/logs/tc-1.log] 
> statusurl=[http://127.0.0.1:10305/exam cancelled
> Jul 09 15:47:43 fmp-dun-tapp1 crmd: [4267]: info: do_lrm_rsc_op: Performing 
> key=24:226:0:2c2c0209-48a9-40d0-b4f9-53ea1adcd584 op=tomcat_tc1:0_stop_0 )
> Jul 09 15:47:43 fmp-dun-tapp1 lrmd: [4264]: info: rsc:tomcat_tc1:0:186: stop
> Jul 09 15:47:43 fmp-dun-tapp1 crmd: [4267]: info: process_lrm_event: LRM 
> operation tomcat_tc1:0_monitor_10000 (call=185, status=1, cib-update=0, 
> confirmed=true) Cancelled
> Jul 09 15:47:43 fmp-dun-tapp1 cib: [31183]: info: write_cib_contents: 
> Archived previous version as /var/lib/heartbeat/crm/cib-11.raw
> Jul 09 15:47:43 fmp-dun-tapp1 cib: [31183]: info: write_cib_contents: Wrote 
> version 0.259.0 of the CIB to disk (digest: e028d9e440a93208328ecb4eada8fdf6)
> Jul 09 15:47:43 fmp-dun-tapp1 cib: [31183]: info: retrieveCib: Reading 
> cluster configuration from: /var/lib/heartbeat/crm/cib.8JcqTH (digest: 
> /var/lib/heartbeat/crm/cib.V73jBK)
> Jul 09 15:47:49 fmp-dun-tapp1 crmd: [4267]: info: process_lrm_event: LRM 
> operation tomcat_tc1:0_stop_0 (call=186, rc=0, cib-update=210, 
> confirmed=true) ok
> Jul 09 15:47:50 fmp-dun-tapp1 crmd: [4267]: info: do_lrm_rsc_op: Performing 
> key=23:227:0:2c2c0209-48a9-40d0-b4f9-53ea1adcd584 op=tomcat_tc1:0_start_0 )
> Jul 09 15:47:50 fmp-dun-tapp1 lrmd: [4264]: info: rsc:tomcat_tc1:0:187: start
> Jul 09 15:47:54 fmp-dun-tapp1 crmd: [4267]: info: process_lrm_event: LRM 
> operation tomcat_tc1:0_start_0 (call=187, rc=0, cib-update=211, 
> confirmed=true) ok
> Jul 09 15:47:54 fmp-dun-tapp1 crmd: [4267]: info: do_lrm_rsc_op: Performing 
> key=24:227:0:2c2c0209-48a9-40d0-b4f9-53ea1adcd584 
> op=tomcat_tc1:0_monitor_10000 )
> Jul 09 15:47:54 fmp-dun-tapp1 lrmd: [4264]: info: rsc:tomcat_tc1:0:188: 
> monitor
> Jul 09 15:47:54 fmp-dun-tapp1 crmd: [4267]: info: process_lrm_event: LRM 
> operation tomcat_tc1:0_monitor_10000 (call=188, rc=0, cib-update=212, 
> confirmed=false) ok
> 
> Process list is:
> root     23162     1  0 15:39 ?        00:00:00   /bin/sh 
> /usr/lib/ocf/resource.d//intact/tomcat start
> root     31372     1  0 15:47 ?        00:00:00   /bin/sh 
> /usr/lib/ocf/resource.d//intact/tomcat start
> tomcat   31408     1  0 15:47 ?        00:00:03   /usr/lib/jvm/java/bin/java 
> -Djava.util.logging.config.file=/home/tomcat/tc-1/conf/logging.properties
> ... snip ..
> org.apache.catalina.startup.Bootstrap start
> 
> There was a 'crm resource restart cl_tomcat_tc1' issued at 15:47.
> 
> In the above ha-debug log you can clearly see the lrmd starting
> tomcat at both points in time (15:39 and 15:47) and receiving a
> successful start ok response.
> 
> 
> 
> -----Original Message-----
> From: Dejan Muhamedagic [mailto:[email protected]]
> Sent: Fri 09/07/2010 14:46
> To: General Linux-HA mailing list
> Subject: Re: [Linux-HA] Tomcat Resource Agent always leaves dead process on 
> stop or restart
>  
> Hi,
> 
> On Fri, Jul 09, 2010 at 01:04:09PM +0100, Brett Delle Grazie wrote:
> > Hi,
> > 
> > Yes checked both logs:
> > 
> > Catalina.out specifies normal (successful) Tomcat startup.
> > 
> > tc-1.log (log from backgrounded start/stop operations):
> > 
> > Doesn't give anything unusual:
> > 2010/07/09 09:42:13: start ===========================
> > 2010/07/09 10:20:46: stop  ###########################
> > 2010/07/09 10:27:35: start ===========================
> > 2010/07/09 12:50:20: stop  ###########################
> > 2010/07/09 12:50:26: start ===========================
> > 
> > Yes, I realise these are from later runs but the same thing is still 
> > occurring.
> > 
> > Is it possible that the start operation doesn't close of one of
> > the file descriptors and is left 'hanging' - even though
> > it exits (at least from the perspective of pacemaker)?
> > 
> > Would this explain the ownership of 'init' by the 'tomcat
> > start' process instead of by pacemaker?
> 
> No. lrmd kills the process if it doesn't exit within the timeout.
> By "ownership" I guess you mean the parent process. The RA
> process (/usr/lib/ocf/.../tomcat start) is a child of the lrmd.
> init can become its parent only if lrmd exits.
> 
> What is the timeout for that start operation set to? Does the
> process remain even after that timeout? What happens to lrmd?
> 
> > > > > heartbeat-libs-3.0.3-1
> 
> Where does that come from? Normally, you should have
> cluster-libs. Perhaps you need to update.
> 
> Thanks,
> 
> Dejan
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to