On 07/05/2013, at 5:15 PM, Johan Huysmans <johan.huysm...@inuits.be> wrote:
> Hi, > > I only keep a couple of pe-input file, and that pe-inpurt-1 version was > already overwritten. > I redid my tests as describe in my previous mails. > > At the end of the test it was again written to pe-input1, which is included > as attachment. Perfect. Basically the PE doesn't know how to correctly recognise that d_tomcat_monitor_15000 needs to be processed after d_tomcat_last_failure_0: <lrm_rsc_op id="d_tomcat_monitor_15000" operation_key="d_tomcat_monitor_15000" operation="monitor" crm-debug-origin="do_update_resource" crm_feature_set="3.0.7" transition-key="18:360:0:ade789ed-b68e-4f0d-9092-684d0aaa0e89" transition-magic="0:0;18:360:0:ade789ed-b68e-4f0d-9092-684d0aaa0e89" call-id="44" rc-code="0" op-status="0" interval="15000" last-rc-change="1367910303" exec-time="0" queue-time="0" op-digest="0c738dfc69f09a62b7ebf32344fddcf6"/> <lrm_rsc_op id="d_tomcat_last_failure_0" operation_key="d_tomcat_monitor_15000" operation="monitor" crm-debug-origin="do_update_resource" crm_feature_set="3.0.7" transition-key="18:360:0:ade789ed-b68e-4f0d-9092-684d0aaa0e89" transition-magic="0:1;18:360:0:ade789ed-b68e-4f0d-9092-684d0aaa0e89" call-id="44" rc-code="1" op-status="0" interval="15000" last-rc-change="1367909258" exec-time="0" queue-time="0" op-digest="0c738dfc69f09a62b7ebf32344fddcf6"/> which would allow it to recognise that the resource is healthy one again. I'll see what I can do... > > gr. > Johan > > On 2013-05-07 04:08, Andrew Beekhof wrote: >> I have a much clearer idea of the problem you're seeing now, thankyou. >> >> Could you attach /var/lib/pacemaker/pengine/pe-input-1.bz2 from CSE-1 ? >> >> On 03/05/2013, at 10:40 PM, Johan Huysmans <johan.huysm...@inuits.be> wrote: >> >>> Hi, >>> >>> Below you can see my setup and my test, this shows that my cloned resource >>> with on-fail=block does not recover automatically. >>> >>> My Setup: >>> >>> # rpm -aq | grep -i pacemaker >>> pacemaker-libs-1.1.9-1512.el6.i686 >>> pacemaker-cluster-libs-1.1.9-1512.el6.i686 >>> pacemaker-cli-1.1.9-1512.el6.i686 >>> pacemaker-1.1.9-1512.el6.i686 >>> >>> # crm configure show >>> node CSE-1 >>> node CSE-2 >>> primitive d_tomcat ocf:ntc:tomcat \ >>> op monitor interval="15s" timeout="510s" on-fail="block" \ >>> op start interval="0" timeout="510s" \ >>> params instance_name="NMS" monitor_use_ssl="no" >>> monitor_urls="/cse/health" monitor_timeout="120" \ >>> meta migration-threshold="1" >>> primitive ip_11 ocf:heartbeat:IPaddr2 \ >>> op monitor interval="10s" \ >>> params broadcast="172.16.11.31" ip="172.16.11.31" nic="bond0.111" >>> iflabel="ha" \ >>> meta migration-threshold="1" failure-timeout="10" >>> primitive ip_19 ocf:heartbeat:IPaddr2 \ >>> op monitor interval="10s" \ >>> params broadcast="172.18.19.31" ip="172.18.19.31" nic="bond0.119" >>> iflabel="ha" \ >>> meta migration-threshold="1" failure-timeout="10" >>> group svc-cse ip_19 ip_11 >>> clone cl_tomcat d_tomcat >>> colocation colo_tomcat inf: svc-cse cl_tomcat >>> order order_tomcat inf: cl_tomcat svc-cse >>> property $id="cib-bootstrap-options" \ >>> dc-version="1.1.9-1512.el6-2a917dd" \ >>> cluster-infrastructure="cman" \ >>> pe-warn-series-max="9" \ >>> no-quorum-policy="ignore" \ >>> stonith-enabled="false" \ >>> pe-input-series-max="9" \ >>> pe-error-series-max="9" \ >>> last-lrm-refresh="1367582088" >>> >>> Currently only 1 node is available, CSE-1. >>> >>> >>> This is how I am currently testing my setup: >>> >>> => Starting point: Everything up and running >>> >>> # crm resource status >>> Resource Group: svc-cse >>> ip_19 (ocf::heartbeat:IPaddr2): Started >>> ip_11 (ocf::heartbeat:IPaddr2): Started >>> Clone Set: cl_tomcat [d_tomcat] >>> Started: [ CSE-1 ] >>> Stopped: [ d_tomcat:1 ] >>> >>> => Causing failure: Change system so tomcat is running but has a failure >>> (in attachment step_2.log) >>> >>> # crm resource status >>> Resource Group: svc-cse >>> ip_19 (ocf::heartbeat:IPaddr2): Stopped >>> ip_11 (ocf::heartbeat:IPaddr2): Stopped >>> Clone Set: cl_tomcat [d_tomcat] >>> d_tomcat:0 (ocf::ntc:tomcat): Started (unmanaged) FAILED >>> Stopped: [ d_tomcat:1 ] >>> >>> => Fixing failure: Revert system so tomcat is running without failure (in >>> attachment step_3.log) >>> >>> # crm resource status >>> Resource Group: svc-cse >>> ip_19 (ocf::heartbeat:IPaddr2): Stopped >>> ip_11 (ocf::heartbeat:IPaddr2): Stopped >>> Clone Set: cl_tomcat [d_tomcat] >>> d_tomcat:0 (ocf::ntc:tomcat): Started (unmanaged) FAILED >>> Stopped: [ d_tomcat:1 ] >>> >>> As you can see in the logs the OCF script doesn't return any failure. This >>> is noticed by pacemaker, >>> however it doesn't reflect in crm_mon and it doesn't start the depending >>> resources. >>> >>> Gr. >>> Johan >>> >>> On 2013-05-03 03:04, Andrew Beekhof wrote: >>>> On 02/05/2013, at 5:45 PM, Johan Huysmans <johan.huysm...@inuits.be> wrote: >>>> >>>>> On 2013-05-01 05:48, Andrew Beekhof wrote: >>>>>> On 17/04/2013, at 9:54 PM, Johan Huysmans <johan.huysm...@inuits.be> >>>>>> wrote: >>>>>> >>>>>>> Hi All, >>>>>>> >>>>>>> I'm trying to setup a specific configuration in our cluster, however >>>>>>> I'm struggling with my configuration. >>>>>>> >>>>>>> This is what I'm trying to achieve: >>>>>>> On both nodes of the cluster a daemon must be running (tomcat). >>>>>>> Some failover addresses are configured and must be running on the node >>>>>>> with a correctly running tomcat. >>>>>>> >>>>>>> I have this achieved with a cloned tomcat resource and an collocation >>>>>>> between the cloned tomcat and the failover addresses. >>>>>>> When I cause a failure in the tomcat on the node running the failover >>>>>>> addresses, the failover addresses will failover to the other node as >>>>>>> expected. >>>>>>> crm_mon shows that this tomcat has a failure. >>>>>>> When I configure the tomcat resource with failure-timeout=0, the >>>>>>> failure alarm in crm_mon isn't cleared whenever the tomcat failure is >>>>>>> fixed. >>>>>> All sounds right so far. >>>>> If my broken tomcat is automatically fixed, I expect this to be noticed >>>>> by pacemaker and that that node will be able to run my failover addresses, >>>>> however I don't see this happening. >>>> This is very hard to discuss without seeing logs. >>>> >>>> So you created a tomcat error, waited for pacemaker to notice, fixed the >>>> error and observed the pacemaker did not re-notice? >>>> How long did you wait? More than the 15s repeat interval I assume? Did at >>>> least the resource agent notice? >>>> >>>>>>> When I configure the tomcat resource with failure-timeout=30, the >>>>>>> failure alarm in crm_mon is cleared after 30seconds however the tomcat >>>>>>> is still having a failure. >>>>>> Can you define "still having a failure"? >>>>>> You mean it still shows up in crm_mon? >>>>>> Have you read this link? >>>>>> >>>>>> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-crmsh/html/Pacemaker_Explained/s-rules-recheck.html >>>>> "Still having a failure" means that the tomcat is still broken and my OCF >>>>> script reports it as a failure. >>>>>>> What I expect is that pacemaker reports the failure as the failure >>>>>>> exists and as long as it exists and that pacemaker reports that >>>>>>> everything is ok once everything is back ok. >>>>>>> >>>>>>> Do I do something wrong with my configuration? >>>>>>> Or how can I achieve my wanted setup? >>>>>>> >>>>>>> Here is my configuration: >>>>>>> >>>>>>> node CSE-1 >>>>>>> node CSE-2 >>>>>>> primitive d_tomcat ocf:custom:tomcat \ >>>>>>> op monitor interval="15s" timeout="510s" on-fail="block" \ >>>>>>> op start interval="0" timeout="510s" \ >>>>>>> params instance_name="NMS" monitor_use_ssl="no" >>>>>>> monitor_urls="/cse/health" monitor_timeout="120" \ >>>>>>> meta migration-threshold="1" failure-timeout="0" >>>>>>> primitive ip_1 ocf:heartbeat:IPaddr2 \ >>>>>>> op monitor interval="10s" \ >>>>>>> params nic="bond0" broadcast="10.1.1.1" iflabel="ha" ip="10.1.1.1" >>>>>>> primitive ip_2 ocf:heartbeat:IPaddr2 \ >>>>>>> op monitor interval="10s" \ >>>>>>> params nic="bond0" broadcast="10.1.1.2" iflabel="ha" ip="10.1.1.2" >>>>>>> group svc-cse ip_1 ip_2 >>>>>>> clone cl_tomcat d_tomcat >>>>>>> colocation colo_tomcat inf: svc-cse cl_tomcat >>>>>>> order order_tomcat inf: cl_tomcat svc-cse >>>>>>> property $id="cib-bootstrap-options" \ >>>>>>> dc-version="1.1.8-7.el6-394e906" \ >>>>>>> cluster-infrastructure="cman" \ >>>>>>> no-quorum-policy="ignore" \ >>>>>>> stonith-enabled="false" >>>>>>> >>>>>>> Thanks! >>>>>>> >>>>>>> Greetings, >>>>>>> Johan Huysmans >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>>> >>>>>>> Project Home: http://www.clusterlabs.org >>>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>>> Bugs: http://bugs.clusterlabs.org >>>>>> _______________________________________________ >>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>> >>>>>> Project Home: http://www.clusterlabs.org >>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>> Bugs: http://bugs.clusterlabs.org >>>>> _______________________________________________ >>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>> >>>>> Project Home: http://www.clusterlabs.org >>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>> Bugs: http://bugs.clusterlabs.org >>>> _______________________________________________ >>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>> >>>> Project Home: http://www.clusterlabs.org >>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>> Bugs: http://bugs.clusterlabs.org >>> <step_2.log><step_3.log>_______________________________________________ >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: http://bugs.clusterlabs.org >> >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > > <pe-input-1.bz2>_______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org