Hi, On Fri, Jul 09, 2010 at 04:01:14PM +0100, Brett Delle Grazie wrote: > Hi, > > Yes I meant parent of the process is init. > > Timeout for start operation is: 120 seconds - process is still > around (so is tomcat) after this.
According to these log messages: > Jul 09 15:39:47 fmp-dun-tapp1 lrmd: [4264]: info: rsc:tomcat_tc1:0:184: start > Jul 09 15:39:51 fmp-dun-tapp1 crmd: [4267]: info: process_lrm_event: LRM > operation tomcat_tc1:0_start_0 (call=184, rc=0, cib-update=208, > confirmed=true) ok > Jul 09 15:47:43 fmp-dun-tapp1 lrmd: [4264]: info: rsc:tomcat_tc1:0:186: stop > Jul 09 15:47:49 fmp-dun-tapp1 crmd: [4267]: info: process_lrm_event: LRM > operation tomcat_tc1:0_stop_0 (call=186, rc=0, cib-update=210, > confirmed=true) ok > Jul 09 15:47:50 fmp-dun-tapp1 crmd: [4267]: info: do_lrm_rsc_op: Performing > key=23:227:0:2c2c0209-48a9-40d0-b4f9-53ea1adcd584 op=tomcat_tc1:0_start_0 ) > Jul 09 15:47:50 fmp-dun-tapp1 lrmd: [4264]: info: rsc:tomcat_tc1:0:187: start > Jul 09 15:47:54 fmp-dun-tapp1 crmd: [4267]: info: process_lrm_event: LRM > operation tomcat_tc1:0_start_0 (call=187, rc=0, cib-update=211, > confirmed=true) ok everything seems to run fine. The start operation takes around 4 seconds. It seems to exit just fine. I really don't know what's happening. Can you run this again with debug turned on. Then produce a hb_report. If it's too big to send to the list, you can open a bugzilla. > heartbeat-libs - taken from linbit repo packages where we have > a support contract as we use DRBD in other stuff - OK. Perhaps you can upgrade from linbit? Though I can't recall ever seeing anything like this. Thanks, Dejan > I didn't realise they were renamed to cluster-libs on > clusterlabs.org. > Hmm. I can update but can the packages on clusterlabs still use heartbeat or > will I need to switch to corosync? > > The log from pacemaker / heartbeat is: > /var/log/ha-debug > Jul 09 15:39:47 fmp-dun-tapp1 lrmd: [4264]: info: rsc:tomcat_tc1:0:184: start > Jul 09 15:39:47 fmp-dun-tapp1 attrd: [4266]: info: attrd_ha_callback: flush > message from fmp-dun-tapp2 > Jul 09 15:39:47 fmp-dun-tapp1 attrd: [4266]: info: attrd_ha_callback: flush > message from fmp-dun-tapp2 > Jul 09 15:39:47 fmp-dun-tapp1 attrd: [4266]: info: attrd_ha_callback: flush > message from fmp-dun-tapp2 > Jul 09 15:39:47 fmp-dun-tapp1 attrd: [4266]: info: attrd_ha_callback: flush > message from fmp-dun-tapp2 > Jul 09 15:39:47 fmp-dun-tapp1 attrd: [4266]: info: attrd_ha_callback: flush > message from fmp-dun-tapp2 > Jul 09 15:39:47 fmp-dun-tapp1 attrd: [4266]: info: attrd_ha_callback: flush > message from fmp-dun-tapp2 > Jul 09 15:39:47 fmp-dun-tapp1 attrd: [4266]: info: attrd_ha_callback: flush > message from fmp-dun-tapp2 > Jul 09 15:39:47 fmp-dun-tapp1 attrd: [4266]: info: attrd_ha_callback: flush > message from fmp-dun-tapp2 > Jul 09 15:39:48 fmp-dun-tapp1 attrd: [4266]: info: attrd_ha_callback: flush > message from fmp-dun-tapp2 > Jul 09 15:39:48 fmp-dun-tapp1 attrd: [4266]: info: attrd_ha_callback: flush > message from fmp-dun-tapp2 > Jul 09 15:39:48 fmp-dun-tapp1 attrd: [4266]: info: attrd_ha_callback: flush > message from fmp-dun-tapp2 > Jul 09 15:39:51 fmp-dun-tapp1 crmd: [4267]: info: process_lrm_event: LRM > operation tomcat_tc1:0_start_0 (call=184, rc=0, cib-update=208, > confirmed=true) ok > Jul 09 15:39:51 fmp-dun-tapp1 crmd: [4267]: info: do_lrm_rsc_op: Performing > key=24:225:0:2c2c0209-48a9-40d0-b4f9-53ea1adcd584 > op=tomcat_tc1:0_monitor_10000 ) > Jul 09 15:39:51 fmp-dun-tapp1 lrmd: [4264]: info: rsc:tomcat_tc1:0:185: > monitor > Jul 09 15:39:51 fmp-dun-tapp1 crmd: [4267]: info: process_lrm_event: LRM > operation tomcat_tc1:0_monitor_10000 (call=185, rc=0, cib-update=209, > confirmed=false) ok > Jul 09 15:40:16 fmp-dun-tapp1 lrmd: [4264]: info: rsc:cmon_html0:0:174: > monitor > Jul 09 15:42:12 fmp-dun-tapp1 cib: [4263]: info: cib_stats: Processed 198 > operations (909.00us average, 0% utilization) in the last 10min > Jul 09 15:47:43 fmp-dun-tapp1 lrmd: [4264]: info: cancel_op: operation > monitor[185] on ocf::tomcat::tomcat_tc1:0 for client 4267, its parameters: > CRM_meta_interval=[10000] catalina_home=[/opt/tomcat] > catalina_base=[/home/tomcat/tc-1] tomcat_user=[tomcat] > catalina_pid=[/home/tomcat/tc-1/temp/tomcat.pid] catalina_rotate_log=[YES] > CRM_meta_timeout=[30000] CRM_meta_clone_max=[2] crm_feature_set=[3.0.1] > java_home=[/usr/lib/jvm/java] CRM_meta_globally_unique=[false] > CRM_meta_name=[monitor] script_log=[/home/tomcat/tc-1/logs/tc-1.log] > statusurl=[http://127.0.0.1:10305/exam cancelled > Jul 09 15:47:43 fmp-dun-tapp1 crmd: [4267]: info: do_lrm_rsc_op: Performing > key=24:226:0:2c2c0209-48a9-40d0-b4f9-53ea1adcd584 op=tomcat_tc1:0_stop_0 ) > Jul 09 15:47:43 fmp-dun-tapp1 lrmd: [4264]: info: rsc:tomcat_tc1:0:186: stop > Jul 09 15:47:43 fmp-dun-tapp1 crmd: [4267]: info: process_lrm_event: LRM > operation tomcat_tc1:0_monitor_10000 (call=185, status=1, cib-update=0, > confirmed=true) Cancelled > Jul 09 15:47:43 fmp-dun-tapp1 cib: [31183]: info: write_cib_contents: > Archived previous version as /var/lib/heartbeat/crm/cib-11.raw > Jul 09 15:47:43 fmp-dun-tapp1 cib: [31183]: info: write_cib_contents: Wrote > version 0.259.0 of the CIB to disk (digest: e028d9e440a93208328ecb4eada8fdf6) > Jul 09 15:47:43 fmp-dun-tapp1 cib: [31183]: info: retrieveCib: Reading > cluster configuration from: /var/lib/heartbeat/crm/cib.8JcqTH (digest: > /var/lib/heartbeat/crm/cib.V73jBK) > Jul 09 15:47:49 fmp-dun-tapp1 crmd: [4267]: info: process_lrm_event: LRM > operation tomcat_tc1:0_stop_0 (call=186, rc=0, cib-update=210, > confirmed=true) ok > Jul 09 15:47:50 fmp-dun-tapp1 crmd: [4267]: info: do_lrm_rsc_op: Performing > key=23:227:0:2c2c0209-48a9-40d0-b4f9-53ea1adcd584 op=tomcat_tc1:0_start_0 ) > Jul 09 15:47:50 fmp-dun-tapp1 lrmd: [4264]: info: rsc:tomcat_tc1:0:187: start > Jul 09 15:47:54 fmp-dun-tapp1 crmd: [4267]: info: process_lrm_event: LRM > operation tomcat_tc1:0_start_0 (call=187, rc=0, cib-update=211, > confirmed=true) ok > Jul 09 15:47:54 fmp-dun-tapp1 crmd: [4267]: info: do_lrm_rsc_op: Performing > key=24:227:0:2c2c0209-48a9-40d0-b4f9-53ea1adcd584 > op=tomcat_tc1:0_monitor_10000 ) > Jul 09 15:47:54 fmp-dun-tapp1 lrmd: [4264]: info: rsc:tomcat_tc1:0:188: > monitor > Jul 09 15:47:54 fmp-dun-tapp1 crmd: [4267]: info: process_lrm_event: LRM > operation tomcat_tc1:0_monitor_10000 (call=188, rc=0, cib-update=212, > confirmed=false) ok > > Process list is: > root 23162 1 0 15:39 ? 00:00:00 /bin/sh > /usr/lib/ocf/resource.d//intact/tomcat start > root 31372 1 0 15:47 ? 00:00:00 /bin/sh > /usr/lib/ocf/resource.d//intact/tomcat start > tomcat 31408 1 0 15:47 ? 00:00:03 /usr/lib/jvm/java/bin/java > -Djava.util.logging.config.file=/home/tomcat/tc-1/conf/logging.properties > ... snip .. > org.apache.catalina.startup.Bootstrap start > > There was a 'crm resource restart cl_tomcat_tc1' issued at 15:47. > > In the above ha-debug log you can clearly see the lrmd starting > tomcat at both points in time (15:39 and 15:47) and receiving a > successful start ok response. > > > > -----Original Message----- > From: Dejan Muhamedagic [mailto:[email protected]] > Sent: Fri 09/07/2010 14:46 > To: General Linux-HA mailing list > Subject: Re: [Linux-HA] Tomcat Resource Agent always leaves dead process on > stop or restart > > Hi, > > On Fri, Jul 09, 2010 at 01:04:09PM +0100, Brett Delle Grazie wrote: > > Hi, > > > > Yes checked both logs: > > > > Catalina.out specifies normal (successful) Tomcat startup. > > > > tc-1.log (log from backgrounded start/stop operations): > > > > Doesn't give anything unusual: > > 2010/07/09 09:42:13: start =========================== > > 2010/07/09 10:20:46: stop ########################### > > 2010/07/09 10:27:35: start =========================== > > 2010/07/09 12:50:20: stop ########################### > > 2010/07/09 12:50:26: start =========================== > > > > Yes, I realise these are from later runs but the same thing is still > > occurring. > > > > Is it possible that the start operation doesn't close of one of > > the file descriptors and is left 'hanging' - even though > > it exits (at least from the perspective of pacemaker)? > > > > Would this explain the ownership of 'init' by the 'tomcat > > start' process instead of by pacemaker? > > No. lrmd kills the process if it doesn't exit within the timeout. > By "ownership" I guess you mean the parent process. The RA > process (/usr/lib/ocf/.../tomcat start) is a child of the lrmd. > init can become its parent only if lrmd exits. > > What is the timeout for that start operation set to? Does the > process remain even after that timeout? What happens to lrmd? > > > > > > heartbeat-libs-3.0.3-1 > > Where does that come from? Normally, you should have > cluster-libs. Perhaps you need to update. > > Thanks, > > Dejan _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
