On Thu, Aug 16, 2012 at 9:15 AM, Jane Du (jadu) <[email protected]> wrote: > Hi Andrew, > > Can you please let me know what the next version number is?
1.1.8 > What is the link I can find the release information? It hasn't been released yet > Thanks, > Jane > > > -----Original Message----- > From: Andrew Beekhof [mailto:[email protected]] > Sent: Wednesday, August 15, 2012 3:02 AM > To: The Pacemaker cluster resource manager > Subject: Re: [Pacemaker] Expired fail-count doesn't get cleaned up. > > On Wed, Aug 15, 2012 at 3:01 PM, Jane Du (jadu) <[email protected]> wrote: >> Hi Andrew: >> >> The fail-count is not cleaned up in my setup after 2s. > > Based on some more information from David, it appears that there was > one more bug lurking that prevented cleanup. > This has now been fixed for the next version. > >> This is a two nodes configuration and only one node is bought up. >> When the resource is killed, I expect it comes up again after 2s when no peer >> is available. >> The configuration and log are followed. Please shade some light. >> >> crm_mon -f >> ============ >> Last updated: Tue Aug 14 21:13:42 2012 >> Stack: Heartbeat >> Current DC: vm-mgmt-1 (8acb717d-2c8b-4484-8165-efeeb75dc949) - partition >> with qu >> orum >> Version: 1.1.5-1.1.el5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f >> 1 Nodes configured, unknown expected votes >> 1 Resources configured. >> ============ >> >> Online: [ vm-mgmt-1 ] >> Migration summary: >> * Node vm-mgmt-1: >> mhm_ra: migration-threshold=1 fail-count=1 last-failure='Tue Aug 14 >> 21:13:37 >> 2012 >> >> >> Crm configure show: >> node $id="8acb717d-2c8b-4484-8165-efeeb75dc949" vm-mgmt-1 >> primitive mhm_ra ocf:heartbeat:anything \ >> params binfile="mhm_ra.bin" workdir="/usr/bin/" >> pidfile="/var/run/mhm_ra.pid" \ >> op monitor on-fail="restart" interval="1s" >> property $id="cib-bootstrap-options" \ >> dc-version="1.1.5-1.1.el5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f" \ >> cluster-infrastructure="Heartbeat" \ >> stonith-enabled="false" >> rsc_defaults $id="rsc-options" \ >> resource-stickiness="100" \ >> migration-threshold="1" \ >> failure-timeout="2s" >> >> kill the mhm_ra, after two seconds: >> [root@vm-mgmt-1 ~]# crm_failcount -G -U vm-mgmt-1 -r mhm_ra >> scope=status name=fail-count-mhm_ra value=1 >> [root@vm-mgmt-1 ~]# crm_failcount -G -U vm-mgmt-1 -r mhm_ra >> scope=status name=fail-count-mhm_ra value=1 >> [root@vm-mgmt-1 ~]# crm_failcount -G -U vm-mgmt-1 -r mhm_ra >> scope=status name=fail-count-mhm_ra value=1 >> [root@vm-mgmt-1 ~]# >> [root@vm-mgmt-1 ~]# date >> Tue Aug 14 21:17:09 PDT 2012 >> [root@vm-mgmt-1 ~]# crm_failcount -G -U vm-mgmt-1 -r mhm_ra >> >> scope=status name=fail-count-mhm_ra value=1 >> Aug 14 21:13:36 vm-mgmt-1 lrmd: [28544]: info: RA output: >> (mhm_ra:monitor:stderr) /usr/lib/ocf/resource.d//heartbeat/anything: line >> 60: kill: (24612) - No such process >> Aug 14 21:13:36 vm-mgmt-1 crmd: [28547]: ERROR: te_connect_stonith: Sign-in >> failed: triggered a retry >> Aug 14 21:13:36 vm-mgmt-1 crmd: [28547]: info: process_lrm_event: LRM >> operation mhm_ra_monitor_1000 (call=3, rc=1, cib-update=51, confirmed=false) >> unknown error >> Aug 14 21:13:36 vm-mgmt-1 crmd: [28547]: info: te_connect_stonith: >> Attempting connection to fencing daemon... >> Aug 14 21:13:37 vm-mgmt-1 lrmd: [28544]: info: RA output: >> (mhm_ra:monitor:stderr) /usr/lib/ocf/resource.d//heartbeat/anything: line >> 60: kill: (24612) - No such process >> Aug 14 21:13:37 vm-mgmt-1 crmd: [28547]: ERROR: te_connect_stonith: Sign-in >> failed: triggered a retry >> Aug 14 21:13:37 vm-mgmt-1 crmd: [28547]: info: process_graph_event: Detected >> action mhm_ra_monitor_1000 from a different transition: 1 vs. 2 >> Aug 14 21:13:37 vm-mgmt-1 crmd: [28547]: info: abort_transition_graph: >> process_graph_event:473 - Triggered transition abort (complete=1, >> tag=lrm_rsc_op, id=mhm_ra_monitor_1000, >> magic=0:1;6:1:0:14c32c8c-ee5e-485c-8b17-9468a96ac127, cib=0.7.26) : Old event >> Aug 14 21:13:37 vm-mgmt-1 crmd: [28547]: WARN: update_failcount: Updating >> failcount for mhm_ra on vm-mgmt-1 after failed monitor: rc=1 >> (update=value++, time=1345004017) >> Aug 14 21:13:37 vm-mgmt-1 attrd: [28546]: info: find_hash_entry: Creating >> hash entry for fail-count-mhm_ra >> Aug 14 21:13:37 vm-mgmt-1 attrd: [28546]: info: attrd_local_callback: >> Expanded fail-count-mhm_ra=value++ to 1 >> Aug 14 21:13:37 vm-mgmt-1 attrd: [28546]: info: attrd_trigger_update: >> Sending flush op to all hosts for: fail-count-mhm_ra (1) >> Aug 14 21:13:37 vm-mgmt-1 crmd: [28547]: info: do_state_transition: State >> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL >> origin=abort_transition_graph ] >> Aug 14 21:13:37 vm-mgmt-1 crmd: [28547]: info: do_state_transition: All 1 >> cluster nodes are eligible to run resources. >> Aug 14 21:13:37 vm-mgmt-1 crmd: [28547]: info: do_pe_invoke: Query 52: >> Requesting the current CIB: S_POLICY_ENGINE >> Aug 14 21:13:37 vm-mgmt-1 crmd: [28547]: info: te_connect_stonith: >> Attempting connection to fencing daemon... >> Aug 14 21:13:37 vm-mgmt-1 attrd: [28546]: info: attrd_perform_update: Sent >> update 31: fail-count-mhm_ra=1 >> Aug 14 21:13:37 vm-mgmt-1 attrd: [28546]: info: find_hash_entry: Creating >> hash entry for last-failure-mhm_ra >> Aug 14 21:13:37 vm-mgmt-1 attrd: [28546]: info: attrd_trigger_update: >> Sending flush op to all hosts for: last-failure-mhm_ra (1345004017) >> Aug 14 21:13:37 vm-mgmt-1 attrd: [28546]: info: attrd_perform_update: Sent >> update 34: last-failure-mhm_ra=1345004017 >> Aug 14 21:13:38 vm-mgmt-1 lrmd: [28544]: info: RA output: >> (mhm_ra:monitor:stderr) /usr/lib/ocf/resource.d//heartbeat/anything: line >> 60: kill: (24612) - No such process >> Aug 14 21:13:38 vm-mgmt-1 crmd: [28547]: ERROR: te_connect_stonith: Sign-in >> failed: triggered a retry >> Aug 14 21:13:38 vm-mgmt-1 crmd: [28547]: info: do_pe_invoke_callback: >> Invoking the PE: query=52, ref=pe_calc-dc-1345004018-21, seq=1, quorate=1 >> Aug 14 21:13:38 vm-mgmt-1 crmd: [28547]: info: abort_transition_graph: >> te_update_diff:149 - Triggered transition abort (complete=1, tag=nvpair, >> id=status-8acb717d-2c8b-4484-8165-efeeb75dc949-fail-count-mhm_ra, magic=NA, >> cib=0.7.27) : Transient attribute: update >> Aug 14 21:13:38 vm-mgmt-1 crmd: [28547]: info: abort_transition_graph: >> te_update_diff:149 - Triggered transition abort (complete=1, tag=nvpair, >> id=status-8acb717d-2c8b-4484-8165-efeeb75dc949-last-failure-mhm_ra, >> magic=NA, cib=0.7.28) : Transient attribute: update >> Aug 14 21:13:38 vm-mgmt-1 crmd: [28547]: info: do_pe_invoke: Query 53: >> Requesting the current CIB: S_POLICY_ENGINE >> Aug 14 21:13:38 vm-mgmt-1 crmd: [28547]: info: do_pe_invoke: Query 54: >> Requesting the current CIB: S_POLICY_ENGINE >> Aug 14 21:13:38 vm-mgmt-1 pengine: [21194]: notice: unpack_rsc_op: Operation >> mhm_ra_monitor_0 found resource mhm_ra active on vm-mgmt-1 >> Aug 14 21:13:38 vm-mgmt-1 pengine: [21194]: notice: unpack_rsc_op: Ignoring >> expired failure mhm_ra_monitor_1000 (rc=1, >> magic=0:1;6:1:0:14c32c8c-ee5e-485c-8b17-9468a96ac127) on vm-mgmt-1 >> Aug 14 21:13:38 vm-mgmt-1 crmd: [28547]: info: te_connect_stonith: >> Attempting connection to fencing daemon... >> Aug 14 21:13:38 vm-mgmt-1 pengine: [21194]: notice: native_print: mhm_ra >> (ocf::heartbeat:anything): Started vm-mgmt-1 >> Aug 14 21:13:38 vm-mgmt-1 pengine: [21194]: notice: LogActions: Leave >> mhm_ra (Started vm-mgmt-1) >> Aug 14 21:13:39 vm-mgmt-1 lrmd: [28544]: info: RA output: >> (mhm_ra:monitor:stderr) /usr/lib/ocf/resource.d//heartbeat/anything: line >> 60: kill: (24612) - No such process >> Aug 14 21:13:39 vm-mgmt-1 crmd: [28547]: ERROR: te_connect_stonith: Sign-in >> failed: triggered a retry >> Aug 14 21:13:39 vm-mgmt-1 crmd: [28547]: info: do_pe_invoke_callback: >> Invoking the PE: query=54, ref=pe_calc-dc-1345004019-22, seq=1, quorate=1 >> Aug 14 21:13:39 vm-mgmt-1 crmd: [28547]: info: handle_response: pe_calc >> calculation pe_calc-dc-1345004018-21 is obsolete >> Aug 14 21:13:39 vm-mgmt-1 crmd: [28547]: info: te_connect_stonith: >> Attempting connection to fencing daemon... >> Aug 14 21:13:39 vm-mgmt-1 pengine: [21194]: notice: unpack_rsc_op: Operation >> mhm_ra_monitor_0 found resource mhm_ra active on vm-mgmt-1 >> Aug 14 21:13:39 vm-mgmt-1 pengine: [21194]: notice: unpack_rsc_op: Ignoring >> expired failure mhm_ra_monitor_1000 (rc=1, >> magic=0:1;6:1:0:14c32c8c-ee5e-485c-8b17-9468a96ac127) on vm-mgmt-1 >> Aug 14 21:13:39 vm-mgmt-1 pengine: [21194]: notice: native_print: mhm_ra >> (ocf::heartbeat:anything): Started vm-mgmt-1 >> Aug 14 21:13:39 vm-mgmt-1 pengine: [21194]: WARN: common_apply_stickiness: >> Forcing mhm_ra away from vm-mgmt-1 after 1 failures (max=1) >> Aug 14 21:13:39 vm-mgmt-1 pengine: [21194]: notice: LogActions: Stop >> mhm_ra (vm-mgmt-1) ***** when the peer is not available, stickiess should >> not take effect, right? Jane*** > > No. > >> Aug 14 21:13:40 vm-mgmt-1 lrmd: [28544]: info: RA output: >> (mhm_ra:monitor:stderr) /usr/lib/ocf/resource.d//heartbeat/anything: line >> 60: kill: (24612) - No such process >> Aug 14 21:13:40 vm-mgmt-1 crmd: [28547]: ERROR: te_connect_stonith: Sign-in >> failed: triggered a retry ****** why this is kept logged when no stonith >> enabled. Jane *** > > Because something is very very wrong. The process should be there > even if stonith is not enabled. > >> Aug 14 21:13:40 vm-mgmt-1 crmd: [28547]: info: do_state_transition: State >> transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS >> cause=C_IPC_MESSAGE origin=handle_response ] >> Aug 14 21:13:40 vm-mgmt-1 crmd: [28547]: info: unpack_graph: Unpacked >> transition 4: 2 actions in 2 synapses >> Aug 14 21:13:40 vm-mgmt-1 crmd: [28547]: info: do_te_invoke: Processing >> graph 4 (ref=pe_calc-dc-1345004019-22) derived from >> /var/lib/pengine/pe-input-4.bz2 >> Aug 14 21:13:40 vm-mgmt-1 crmd: [28547]: info: te_connect_stonith: >> Attempting connection to fencing daemon... >> >> >> Thanks, >> >> Jane >> >> -----Original Message----- >> From: Andrew Beekhof [mailto:[email protected]] >> Sent: Monday, August 13, 2012 7:48 PM >> To: [email protected]; The Pacemaker cluster resource manager >> Subject: Re: [Pacemaker] Expired fail-count doesn't get cleaned up. >> >> On Tue, Aug 14, 2012 at 10:15 AM, Mario Penners <[email protected]> >> wrote: >>> Hi David, >>> >>> I understand the failcount only gets reset after a probe is run. So you >>> need to give it a "crm resource reprobe" for the expiry timer to be >>> evaluated. >> >> Not for recent versions of 1.1 >> >>> >>> However: I do NOT know, when the probes are run (I see them in my logs >>> only after failover or start/stop actions are taking place, but not on a >>> regular basis). >> >> When the machine first comes up, and when the admin calls "crm resource >> reprobe" >> >>> So one might have to schedule the crm resource reprobe >>> as a cron or define it as an own resource? >>> >>> Cheers! >>> >>> On Tue, 2012-07-31 at 05:36 -0400, David Coulson wrote: >>>> I'm running RHEL6 with the tech preview of pacemaker it ships with. I've >>>> a number of resources which have a failure-timeout="60", which most of >>>> the time does what it is supposed to. >>>> >>>> Last night a resource failed, which was part of a clone - While the >>>> resource recovered, the fail-count log never got cleaned up. Around >>>> every second the DC logged the pengine message below. I manually did a >>>> resource cleanup, and it seems happy now. Is there something I should be >>>> looking for in the logs to indicate that it 'missed' expiring this? >>>> >>>> Version: 1.1.6-3.el6-a02c0f19a00c1eb2527ad38f146ebc0834814558 >>>> >>>> Migration summary: >>>> * Node dresproddns01: >>>> re-openfire-lsb:0: migration-threshold=1000000 fail-count=1 >>>> last-failure='Mon Jul 30 21:57:53 2012' >>>> * Node dresproddns02: >>>> >>>> >>>> Jul 31 05:32:34 dresproddns02 pengine: [2860]: notice: get_failcount: >>>> Failcount for cl-openfire on dresproddns01 has expired (limit was 60s) >>>> >>>> >>>> _______________________________________________ >>>> Pacemaker mailing list: [email protected] >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>> >>>> Project Home: http://www.clusterlabs.org >>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>> Bugs: http://bugs.clusterlabs.org >>> >>> >>> >>> _______________________________________________ >>> Pacemaker mailing list: [email protected] >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: http://bugs.clusterlabs.org >> >> _______________________________________________ >> Pacemaker mailing list: [email protected] >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org >> >> _______________________________________________ >> Pacemaker mailing list: [email protected] >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > > _______________________________________________ > Pacemaker mailing list: [email protected] > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > > _______________________________________________ > Pacemaker mailing list: [email protected] > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Pacemaker mailing list: [email protected] http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
