On Thu, Aug 16, 2012 at 6:20 PM, Jane Du (jadu) <[email protected]> wrote: > Hi Andrew, > > Can you please help to explain these logs: > > Aug 14 21:13:39 vm-mgmt-1 pengine: [21194]: notice: native_print: mhm_ra > (ocf::heartbeat:anything): Started vm-mgmt-1 > WARN: common_apply_stickiness: Forcing mhm_ra away from vm-mgmt-1 after 1 > failures (max=1) >>> Aug 14 21:13:39 vm-mgmt-1 pengine: [21194]: notice: LogActions: Stop >>> mhm_ra (vm-mgmt-1) > > Although the failure count is not expired, it seems that the resource got > restarted after 2s. But It is stopped due to stickiness configured.
No, unless it failed with 7 (OCF_STOPPED), failed resources are considered running until a successful stop operation > > When the peer is not available, stickiness should not take effect, right? > > Thanks, > > Jane > > -----Original Message----- > From: Andrew Beekhof [mailto:[email protected]] > Sent: Wednesday, August 15, 2012 4:26 PM > To: The Pacemaker cluster resource manager > Subject: Re: [Pacemaker] Expired fail-count doesn't get cleaned up. > > On Thu, Aug 16, 2012 at 9:15 AM, Jane Du (jadu) <[email protected]> wrote: >> Hi Andrew, >> >> Can you please let me know what the next version number is? > > 1.1.8 > >> What is the link I can find the release information? > > It hasn't been released yet > >> Thanks, >> Jane >> >> >> -----Original Message----- >> From: Andrew Beekhof [mailto:[email protected]] >> Sent: Wednesday, August 15, 2012 3:02 AM >> To: The Pacemaker cluster resource manager >> Subject: Re: [Pacemaker] Expired fail-count doesn't get cleaned up. >> >> On Wed, Aug 15, 2012 at 3:01 PM, Jane Du (jadu) <[email protected]> wrote: >>> Hi Andrew: >>> >>> The fail-count is not cleaned up in my setup after 2s. >> >> Based on some more information from David, it appears that there was >> one more bug lurking that prevented cleanup. >> This has now been fixed for the next version. >> >>> This is a two nodes configuration and only one node is bought up. >>> When the resource is killed, I expect it comes up again after 2s when no >>> peer >>> is available. >>> The configuration and log are followed. Please shade some light. >>> >>> crm_mon -f >>> ============ >>> Last updated: Tue Aug 14 21:13:42 2012 >>> Stack: Heartbeat >>> Current DC: vm-mgmt-1 (8acb717d-2c8b-4484-8165-efeeb75dc949) - partition >>> with qu >>> orum >>> Version: 1.1.5-1.1.el5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f >>> 1 Nodes configured, unknown expected votes >>> 1 Resources configured. >>> ============ >>> >>> Online: [ vm-mgmt-1 ] >>> Migration summary: >>> * Node vm-mgmt-1: >>> mhm_ra: migration-threshold=1 fail-count=1 last-failure='Tue Aug 14 >>> 21:13:37 >>> 2012 >>> >>> >>> Crm configure show: >>> node $id="8acb717d-2c8b-4484-8165-efeeb75dc949" vm-mgmt-1 >>> primitive mhm_ra ocf:heartbeat:anything \ >>> params binfile="mhm_ra.bin" workdir="/usr/bin/" >>> pidfile="/var/run/mhm_ra.pid" \ >>> op monitor on-fail="restart" interval="1s" >>> property $id="cib-bootstrap-options" \ >>> dc-version="1.1.5-1.1.el5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f" >>> \ >>> cluster-infrastructure="Heartbeat" \ >>> stonith-enabled="false" >>> rsc_defaults $id="rsc-options" \ >>> resource-stickiness="100" \ >>> migration-threshold="1" \ >>> failure-timeout="2s" >>> >>> kill the mhm_ra, after two seconds: >>> [root@vm-mgmt-1 ~]# crm_failcount -G -U vm-mgmt-1 -r mhm_ra >>> scope=status name=fail-count-mhm_ra value=1 >>> [root@vm-mgmt-1 ~]# crm_failcount -G -U vm-mgmt-1 -r mhm_ra >>> scope=status name=fail-count-mhm_ra value=1 >>> [root@vm-mgmt-1 ~]# crm_failcount -G -U vm-mgmt-1 -r mhm_ra >>> scope=status name=fail-count-mhm_ra value=1 >>> [root@vm-mgmt-1 ~]# >>> [root@vm-mgmt-1 ~]# date >>> Tue Aug 14 21:17:09 PDT 2012 >>> [root@vm-mgmt-1 ~]# crm_failcount -G -U vm-mgmt-1 -r mhm_ra >>> >>> scope=status name=fail-count-mhm_ra value=1 >>> Aug 14 21:13:36 vm-mgmt-1 lrmd: [28544]: info: RA output: >>> (mhm_ra:monitor:stderr) /usr/lib/ocf/resource.d//heartbeat/anything: line >>> 60: kill: (24612) - No such process >>> Aug 14 21:13:36 vm-mgmt-1 crmd: [28547]: ERROR: te_connect_stonith: Sign-in >>> failed: triggered a retry >>> Aug 14 21:13:36 vm-mgmt-1 crmd: [28547]: info: process_lrm_event: LRM >>> operation mhm_ra_monitor_1000 (call=3, rc=1, cib-update=51, >>> confirmed=false) unknown error >>> Aug 14 21:13:36 vm-mgmt-1 crmd: [28547]: info: te_connect_stonith: >>> Attempting connection to fencing daemon... >>> Aug 14 21:13:37 vm-mgmt-1 lrmd: [28544]: info: RA output: >>> (mhm_ra:monitor:stderr) /usr/lib/ocf/resource.d//heartbeat/anything: line >>> 60: kill: (24612) - No such process >>> Aug 14 21:13:37 vm-mgmt-1 crmd: [28547]: ERROR: te_connect_stonith: Sign-in >>> failed: triggered a retry >>> Aug 14 21:13:37 vm-mgmt-1 crmd: [28547]: info: process_graph_event: >>> Detected action mhm_ra_monitor_1000 from a different transition: 1 vs. 2 >>> Aug 14 21:13:37 vm-mgmt-1 crmd: [28547]: info: abort_transition_graph: >>> process_graph_event:473 - Triggered transition abort (complete=1, >>> tag=lrm_rsc_op, id=mhm_ra_monitor_1000, >>> magic=0:1;6:1:0:14c32c8c-ee5e-485c-8b17-9468a96ac127, cib=0.7.26) : Old >>> event >>> Aug 14 21:13:37 vm-mgmt-1 crmd: [28547]: WARN: update_failcount: Updating >>> failcount for mhm_ra on vm-mgmt-1 after failed monitor: rc=1 >>> (update=value++, time=1345004017) >>> Aug 14 21:13:37 vm-mgmt-1 attrd: [28546]: info: find_hash_entry: Creating >>> hash entry for fail-count-mhm_ra >>> Aug 14 21:13:37 vm-mgmt-1 attrd: [28546]: info: attrd_local_callback: >>> Expanded fail-count-mhm_ra=value++ to 1 >>> Aug 14 21:13:37 vm-mgmt-1 attrd: [28546]: info: attrd_trigger_update: >>> Sending flush op to all hosts for: fail-count-mhm_ra (1) >>> Aug 14 21:13:37 vm-mgmt-1 crmd: [28547]: info: do_state_transition: State >>> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL >>> origin=abort_transition_graph ] >>> Aug 14 21:13:37 vm-mgmt-1 crmd: [28547]: info: do_state_transition: All 1 >>> cluster nodes are eligible to run resources. >>> Aug 14 21:13:37 vm-mgmt-1 crmd: [28547]: info: do_pe_invoke: Query 52: >>> Requesting the current CIB: S_POLICY_ENGINE >>> Aug 14 21:13:37 vm-mgmt-1 crmd: [28547]: info: te_connect_stonith: >>> Attempting connection to fencing daemon... >>> Aug 14 21:13:37 vm-mgmt-1 attrd: [28546]: info: attrd_perform_update: Sent >>> update 31: fail-count-mhm_ra=1 >>> Aug 14 21:13:37 vm-mgmt-1 attrd: [28546]: info: find_hash_entry: Creating >>> hash entry for last-failure-mhm_ra >>> Aug 14 21:13:37 vm-mgmt-1 attrd: [28546]: info: attrd_trigger_update: >>> Sending flush op to all hosts for: last-failure-mhm_ra (1345004017) >>> Aug 14 21:13:37 vm-mgmt-1 attrd: [28546]: info: attrd_perform_update: Sent >>> update 34: last-failure-mhm_ra=1345004017 >>> Aug 14 21:13:38 vm-mgmt-1 lrmd: [28544]: info: RA output: >>> (mhm_ra:monitor:stderr) /usr/lib/ocf/resource.d//heartbeat/anything: line >>> 60: kill: (24612) - No such process >>> Aug 14 21:13:38 vm-mgmt-1 crmd: [28547]: ERROR: te_connect_stonith: Sign-in >>> failed: triggered a retry >>> Aug 14 21:13:38 vm-mgmt-1 crmd: [28547]: info: do_pe_invoke_callback: >>> Invoking the PE: query=52, ref=pe_calc-dc-1345004018-21, seq=1, quorate=1 >>> Aug 14 21:13:38 vm-mgmt-1 crmd: [28547]: info: abort_transition_graph: >>> te_update_diff:149 - Triggered transition abort (complete=1, tag=nvpair, >>> id=status-8acb717d-2c8b-4484-8165-efeeb75dc949-fail-count-mhm_ra, magic=NA, >>> cib=0.7.27) : Transient attribute: update >>> Aug 14 21:13:38 vm-mgmt-1 crmd: [28547]: info: abort_transition_graph: >>> te_update_diff:149 - Triggered transition abort (complete=1, tag=nvpair, >>> id=status-8acb717d-2c8b-4484-8165-efeeb75dc949-last-failure-mhm_ra, >>> magic=NA, cib=0.7.28) : Transient attribute: update >>> Aug 14 21:13:38 vm-mgmt-1 crmd: [28547]: info: do_pe_invoke: Query 53: >>> Requesting the current CIB: S_POLICY_ENGINE >>> Aug 14 21:13:38 vm-mgmt-1 crmd: [28547]: info: do_pe_invoke: Query 54: >>> Requesting the current CIB: S_POLICY_ENGINE >>> Aug 14 21:13:38 vm-mgmt-1 pengine: [21194]: notice: unpack_rsc_op: >>> Operation mhm_ra_monitor_0 found resource mhm_ra active on vm-mgmt-1 >>> Aug 14 21:13:38 vm-mgmt-1 pengine: [21194]: notice: unpack_rsc_op: Ignoring >>> expired failure mhm_ra_monitor_1000 (rc=1, >>> magic=0:1;6:1:0:14c32c8c-ee5e-485c-8b17-9468a96ac127) on vm-mgmt-1 >>> Aug 14 21:13:38 vm-mgmt-1 crmd: [28547]: info: te_connect_stonith: >>> Attempting connection to fencing daemon... >>> Aug 14 21:13:38 vm-mgmt-1 pengine: [21194]: notice: native_print: mhm_ra >>> (ocf::heartbeat:anything): Started vm-mgmt-1 >>> Aug 14 21:13:38 vm-mgmt-1 pengine: [21194]: notice: LogActions: Leave >>> mhm_ra (Started vm-mgmt-1) >>> Aug 14 21:13:39 vm-mgmt-1 lrmd: [28544]: info: RA output: >>> (mhm_ra:monitor:stderr) /usr/lib/ocf/resource.d//heartbeat/anything: line >>> 60: kill: (24612) - No such process >>> Aug 14 21:13:39 vm-mgmt-1 crmd: [28547]: ERROR: te_connect_stonith: Sign-in >>> failed: triggered a retry >>> Aug 14 21:13:39 vm-mgmt-1 crmd: [28547]: info: do_pe_invoke_callback: >>> Invoking the PE: query=54, ref=pe_calc-dc-1345004019-22, seq=1, quorate=1 >>> Aug 14 21:13:39 vm-mgmt-1 crmd: [28547]: info: handle_response: pe_calc >>> calculation pe_calc-dc-1345004018-21 is obsolete >>> Aug 14 21:13:39 vm-mgmt-1 crmd: [28547]: info: te_connect_stonith: >>> Attempting connection to fencing daemon... >>> Aug 14 21:13:39 vm-mgmt-1 pengine: [21194]: notice: unpack_rsc_op: >>> Operation mhm_ra_monitor_0 found resource mhm_ra active on vm-mgmt-1 >>> Aug 14 21:13:39 vm-mgmt-1 pengine: [21194]: notice: unpack_rsc_op: Ignoring >>> expired failure mhm_ra_monitor_1000 (rc=1, >>> magic=0:1;6:1:0:14c32c8c-ee5e-485c-8b17-9468a96ac127) on vm-mgmt-1 >>> Aug 14 21:13:39 vm-mgmt-1 pengine: [21194]: notice: native_print: mhm_ra >>> (ocf::heartbeat:anything): Started vm-mgmt-1 >>> Aug 14 21:13:39 vm-mgmt-1 pengine: [21194]: WARN: common_apply_stickiness: >>> Forcing mhm_ra away from vm-mgmt-1 after 1 failures (max=1) >>> Aug 14 21:13:39 vm-mgmt-1 pengine: [21194]: notice: LogActions: Stop >>> mhm_ra (vm-mgmt-1) ***** when the peer is not available, stickiess should >>> not take effect, right? Jane*** >> >> No. >> >>> Aug 14 21:13:40 vm-mgmt-1 lrmd: [28544]: info: RA output: >>> (mhm_ra:monitor:stderr) /usr/lib/ocf/resource.d//heartbeat/anything: line >>> 60: kill: (24612) - No such process >>> Aug 14 21:13:40 vm-mgmt-1 crmd: [28547]: ERROR: te_connect_stonith: Sign-in >>> failed: triggered a retry ****** why this is kept logged when no stonith >>> enabled. Jane *** >> >> Because something is very very wrong. The process should be there >> even if stonith is not enabled. >> >>> Aug 14 21:13:40 vm-mgmt-1 crmd: [28547]: info: do_state_transition: State >>> transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS >>> cause=C_IPC_MESSAGE origin=handle_response ] >>> Aug 14 21:13:40 vm-mgmt-1 crmd: [28547]: info: unpack_graph: Unpacked >>> transition 4: 2 actions in 2 synapses >>> Aug 14 21:13:40 vm-mgmt-1 crmd: [28547]: info: do_te_invoke: Processing >>> graph 4 (ref=pe_calc-dc-1345004019-22) derived from >>> /var/lib/pengine/pe-input-4.bz2 >>> Aug 14 21:13:40 vm-mgmt-1 crmd: [28547]: info: te_connect_stonith: >>> Attempting connection to fencing daemon... >>> >>> >>> Thanks, >>> >>> Jane >>> >>> -----Original Message----- >>> From: Andrew Beekhof [mailto:[email protected]] >>> Sent: Monday, August 13, 2012 7:48 PM >>> To: [email protected]; The Pacemaker cluster resource manager >>> Subject: Re: [Pacemaker] Expired fail-count doesn't get cleaned up. >>> >>> On Tue, Aug 14, 2012 at 10:15 AM, Mario Penners <[email protected]> >>> wrote: >>>> Hi David, >>>> >>>> I understand the failcount only gets reset after a probe is run. So you >>>> need to give it a "crm resource reprobe" for the expiry timer to be >>>> evaluated. >>> >>> Not for recent versions of 1.1 >>> >>>> >>>> However: I do NOT know, when the probes are run (I see them in my logs >>>> only after failover or start/stop actions are taking place, but not on a >>>> regular basis). >>> >>> When the machine first comes up, and when the admin calls "crm resource >>> reprobe" >>> >>>> So one might have to schedule the crm resource reprobe >>>> as a cron or define it as an own resource? >>>> >>>> Cheers! >>>> >>>> On Tue, 2012-07-31 at 05:36 -0400, David Coulson wrote: >>>>> I'm running RHEL6 with the tech preview of pacemaker it ships with. I've >>>>> a number of resources which have a failure-timeout="60", which most of >>>>> the time does what it is supposed to. >>>>> >>>>> Last night a resource failed, which was part of a clone - While the >>>>> resource recovered, the fail-count log never got cleaned up. Around >>>>> every second the DC logged the pengine message below. I manually did a >>>>> resource cleanup, and it seems happy now. Is there something I should be >>>>> looking for in the logs to indicate that it 'missed' expiring this? >>>>> >>>>> Version: 1.1.6-3.el6-a02c0f19a00c1eb2527ad38f146ebc0834814558 >>>>> >>>>> Migration summary: >>>>> * Node dresproddns01: >>>>> re-openfire-lsb:0: migration-threshold=1000000 fail-count=1 >>>>> last-failure='Mon Jul 30 21:57:53 2012' >>>>> * Node dresproddns02: >>>>> >>>>> >>>>> Jul 31 05:32:34 dresproddns02 pengine: [2860]: notice: get_failcount: >>>>> Failcount for cl-openfire on dresproddns01 has expired (limit was 60s) >>>>> >>>>> >>>>> _______________________________________________ >>>>> Pacemaker mailing list: [email protected] >>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>> >>>>> Project Home: http://www.clusterlabs.org >>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>> Bugs: http://bugs.clusterlabs.org >>>> >>>> >>>> >>>> _______________________________________________ >>>> Pacemaker mailing list: [email protected] >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>> >>>> Project Home: http://www.clusterlabs.org >>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>> Bugs: http://bugs.clusterlabs.org >>> >>> _______________________________________________ >>> Pacemaker mailing list: [email protected] >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: http://bugs.clusterlabs.org >>> >>> _______________________________________________ >>> Pacemaker mailing list: [email protected] >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: http://bugs.clusterlabs.org >> >> _______________________________________________ >> Pacemaker mailing list: [email protected] >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org >> >> _______________________________________________ >> Pacemaker mailing list: [email protected] >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > > _______________________________________________ > Pacemaker mailing list: [email protected] > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > > _______________________________________________ > Pacemaker mailing list: [email protected] > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Pacemaker mailing list: [email protected] http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
