On Thu, 16 Dec 2010 08:27:51 +0100, Andrew Beekhof wrote: > On Wed, Dec 15, 2010 at 8:30 AM, Chris Picton
>> Why would a resource cleanup remove the resource from the lrm, even >> though it is still running correctly, > > Thats what cleanup does. > What is supposed to happen next however, is that the cluster runs a > non-recurring monitor operation to re-determine the current state of the > cluster and go from there. > Also, any recurring actions should have been cancelled at the point the > resource was removed from the lrm. > > What versions of pacemaker and cluster-glue do you have? Distro? > I am using the clusterlabs rpms pacemaker-1.0.9.1-1.15.el5 cluster-glue-1.0.6-1.6.el5 I see the following in the output of mon_mon -rf1t (I'm only showing the resources which are showing rc != 0) * Node sbc-tpna2-06.ecntelecoms.za.net: pingd=100 megaswitch:5: migration-threshold=1000000 + (53) probe: last-rc-change='Fri Nov 26 09:17:38 2010' last-run='Fri Nov 26 09:17:38 2010' exec-time=30ms queue-time=0ms rc=1 (unknown error) + (55) stop: last-rc-change='Fri Nov 26 09:17:41 2010' last-run='Fri Nov 26 09:17:41 2010' exec-time=20ms queue-time=0ms rc=0 (ok) + (56) start: last-rc-change='Fri Nov 26 09:17:42 2010' last-run='Fri Nov 26 09:17:42 2010' exec-time=1040ms queue-time=0ms rc=0 (ok) + (57) monitor: interval=8000ms last-rc-change='Fri Nov 26 09:17:44 2010' last-run='Fri Nov 26 09:17:44 2010' exec-time=260ms queue-time=0ms rc=0 (ok) * Node sbc-tpna2-05.ecntelecoms.za.net: pingd=100 megaswitch:4: migration-threshold=1000000 + (58) probe: last-rc-change='Fri Nov 26 09:17:38 2010' last-run='Fri Nov 26 09:17:38 2010' exec-time=30ms queue-time=0ms rc=1 (unknown error) + (60) stop: last-rc-change='Fri Nov 26 09:17:41 2010' last-run='Fri Nov 26 09:17:41 2010' exec-time=20ms queue-time=0ms rc=0 (ok) + (61) start: last-rc-change='Fri Nov 26 09:17:42 2010' last-run='Fri Nov 26 09:17:42 2010' exec-time=1040ms queue-time=0ms rc=0 (ok) + (62) monitor: interval=8000ms last-rc-change='Fri Nov 26 09:17:44 2010' last-run='Fri Nov 26 09:17:44 2010' exec-time=260ms queue-time=0ms rc=0 (ok) Would this affect the result of the 'non-recurring monitor operation' (the probe operations having rc=1) I am not 100% sure why the errors are there - the log on the server for that day shows: ---- Nov 26 09:17:39 sbc-tpna2-06 crmd: [29893]: info: do_lrm_rsc_op: Performing key=36:2184:7:c83a06e0-913e-4546-92e5-19f784dcaf5c op=megaswitch:5_monitor_0 ) Nov 26 09:17:39 sbc-tpna2-06 lrmd: [29890]: info: rsc:megaswitch:5:53: probe Nov 26 09:17:39 sbc-tpna2-06 lrmd: [29890]: WARN: Managed megaswitch:5:monitor process 24823 exited with return code 1. Nov 26 09:17:39 sbc-tpna2-06 lrmd: [29890]: WARN: Managed megaswitch:5:monitor process 24823 exited with return code 1. Nov 26 09:17:39 sbc-tpna2-06 crmd: [29893]: info: process_lrm_event: LRM operation megaswitch:5_monitor_0 (call=53, rc=1, cib-update=68, confirmed=true) unknown error ---- If they are affecting it, how would I clear them, so pacemaker sees everything as OK? Thanks for the help Chris _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker