On 16/05/2013, at 12:45 AM, Johan Huysmans <johan.huysm...@inuits.be> wrote:
> Hi Andrew, > > Thx! > > I tested your github pacemaker repository by building an rpm from it and > installing it on my testsetup. > > Before I could build the rpm I had to change 2 things in the GNUmakefile: > * --without=doc should be --without doc That would be a dependancy issue, I something needed was not installed. > * --target i686 was missing > If I didn't make these modification the rpmbuild command failed (on CentOS6) What was the command you ran? > > I performed the test which failed before and everything seems OK. > Once the failing resource was restored the depending resources were > automatically started. > > Thanks for this fast fix! > > > I which release can I expect this fix? and when is it planned? 1.1.10 planned for as soon as all the bugs are fixed :) we're at rc2 now, rc3 should be today/tomorrow > I will currently use the head build I created. This is ok for my testsetup > but I don't want to run this version in production > > Greetings, > Johan Huysmans > > On 2013-05-10 06:55, Andrew Beekhof wrote: >> Fixed! >> >> https://github.com/beekhof/pacemaker/commit/d87de1b >> >> On 10/05/2013, at 11:59 AM, Andrew Beekhof <and...@beekhof.net> wrote: >> >>> On 07/05/2013, at 5:15 PM, Johan Huysmans <johan.huysm...@inuits.be> wrote: >>> >>>> Hi, >>>> >>>> I only keep a couple of pe-input file, and that pe-inpurt-1 version was >>>> already overwritten. >>>> I redid my tests as describe in my previous mails. >>>> >>>> At the end of the test it was again written to pe-input1, which is >>>> included as attachment. >>> Perfect. >>> Basically the PE doesn't know how to correctly recognise that >>> d_tomcat_monitor_15000 needs to be processed after d_tomcat_last_failure_0: >>> >>> <lrm_rsc_op id="d_tomcat_monitor_15000" >>> operation_key="d_tomcat_monitor_15000" operation="monitor" >>> crm-debug-origin="do_update_resource" crm_feature_set="3.0.7" >>> transition-key="18:360:0:ade789ed-b68e-4f0d-9092-684d0aaa0e89" >>> transition-magic="0:0;18:360:0:ade789ed-b68e-4f0d-9092-684d0aaa0e89" >>> call-id="44" rc-code="0" op-status="0" interval="15000" >>> last-rc-change="1367910303" exec-time="0" queue-time="0" >>> op-digest="0c738dfc69f09a62b7ebf32344fddcf6"/> >>> <lrm_rsc_op id="d_tomcat_last_failure_0" >>> operation_key="d_tomcat_monitor_15000" operation="monitor" >>> crm-debug-origin="do_update_resource" crm_feature_set="3.0.7" >>> transition-key="18:360:0:ade789ed-b68e-4f0d-9092-684d0aaa0e89" >>> transition-magic="0:1;18:360:0:ade789ed-b68e-4f0d-9092-684d0aaa0e89" >>> call-id="44" rc-code="1" op-status="0" interval="15000" >>> last-rc-change="1367909258" exec-time="0" queue-time="0" >>> op-digest="0c738dfc69f09a62b7ebf32344fddcf6"/> >>> >>> which would allow it to recognise that the resource is healthy one again. >>> >>> I'll see what I can do... >>> >>>> gr. >>>> Johan >>>> >>>> On 2013-05-07 04:08, Andrew Beekhof wrote: >>>>> I have a much clearer idea of the problem you're seeing now, thankyou. >>>>> >>>>> Could you attach /var/lib/pacemaker/pengine/pe-input-1.bz2 from CSE-1 ? >>>>> >>>>> On 03/05/2013, at 10:40 PM, Johan Huysmans <johan.huysm...@inuits.be> >>>>> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> Below you can see my setup and my test, this shows that my cloned >>>>>> resource with on-fail=block does not recover automatically. >>>>>> >>>>>> My Setup: >>>>>> >>>>>> # rpm -aq | grep -i pacemaker >>>>>> pacemaker-libs-1.1.9-1512.el6.i686 >>>>>> pacemaker-cluster-libs-1.1.9-1512.el6.i686 >>>>>> pacemaker-cli-1.1.9-1512.el6.i686 >>>>>> pacemaker-1.1.9-1512.el6.i686 >>>>>> >>>>>> # crm configure show >>>>>> node CSE-1 >>>>>> node CSE-2 >>>>>> primitive d_tomcat ocf:ntc:tomcat \ >>>>>> op monitor interval="15s" timeout="510s" on-fail="block" \ >>>>>> op start interval="0" timeout="510s" \ >>>>>> params instance_name="NMS" monitor_use_ssl="no" >>>>>> monitor_urls="/cse/health" monitor_timeout="120" \ >>>>>> meta migration-threshold="1" >>>>>> primitive ip_11 ocf:heartbeat:IPaddr2 \ >>>>>> op monitor interval="10s" \ >>>>>> params broadcast="172.16.11.31" ip="172.16.11.31" nic="bond0.111" >>>>>> iflabel="ha" \ >>>>>> meta migration-threshold="1" failure-timeout="10" >>>>>> primitive ip_19 ocf:heartbeat:IPaddr2 \ >>>>>> op monitor interval="10s" \ >>>>>> params broadcast="172.18.19.31" ip="172.18.19.31" nic="bond0.119" >>>>>> iflabel="ha" \ >>>>>> meta migration-threshold="1" failure-timeout="10" >>>>>> group svc-cse ip_19 ip_11 >>>>>> clone cl_tomcat d_tomcat >>>>>> colocation colo_tomcat inf: svc-cse cl_tomcat >>>>>> order order_tomcat inf: cl_tomcat svc-cse >>>>>> property $id="cib-bootstrap-options" \ >>>>>> dc-version="1.1.9-1512.el6-2a917dd" \ >>>>>> cluster-infrastructure="cman" \ >>>>>> pe-warn-series-max="9" \ >>>>>> no-quorum-policy="ignore" \ >>>>>> stonith-enabled="false" \ >>>>>> pe-input-series-max="9" \ >>>>>> pe-error-series-max="9" \ >>>>>> last-lrm-refresh="1367582088" >>>>>> >>>>>> Currently only 1 node is available, CSE-1. >>>>>> >>>>>> >>>>>> This is how I am currently testing my setup: >>>>>> >>>>>> => Starting point: Everything up and running >>>>>> >>>>>> # crm resource status >>>>>> Resource Group: svc-cse >>>>>> ip_19 (ocf::heartbeat:IPaddr2): Started >>>>>> ip_11 (ocf::heartbeat:IPaddr2): Started >>>>>> Clone Set: cl_tomcat [d_tomcat] >>>>>> Started: [ CSE-1 ] >>>>>> Stopped: [ d_tomcat:1 ] >>>>>> >>>>>> => Causing failure: Change system so tomcat is running but has a failure >>>>>> (in attachment step_2.log) >>>>>> >>>>>> # crm resource status >>>>>> Resource Group: svc-cse >>>>>> ip_19 (ocf::heartbeat:IPaddr2): Stopped >>>>>> ip_11 (ocf::heartbeat:IPaddr2): Stopped >>>>>> Clone Set: cl_tomcat [d_tomcat] >>>>>> d_tomcat:0 (ocf::ntc:tomcat): Started (unmanaged) FAILED >>>>>> Stopped: [ d_tomcat:1 ] >>>>>> >>>>>> => Fixing failure: Revert system so tomcat is running without failure >>>>>> (in attachment step_3.log) >>>>>> >>>>>> # crm resource status >>>>>> Resource Group: svc-cse >>>>>> ip_19 (ocf::heartbeat:IPaddr2): Stopped >>>>>> ip_11 (ocf::heartbeat:IPaddr2): Stopped >>>>>> Clone Set: cl_tomcat [d_tomcat] >>>>>> d_tomcat:0 (ocf::ntc:tomcat): Started (unmanaged) FAILED >>>>>> Stopped: [ d_tomcat:1 ] >>>>>> >>>>>> As you can see in the logs the OCF script doesn't return any failure. >>>>>> This is noticed by pacemaker, >>>>>> however it doesn't reflect in crm_mon and it doesn't start the depending >>>>>> resources. >>>>>> >>>>>> Gr. >>>>>> Johan >>>>>> >>>>>> On 2013-05-03 03:04, Andrew Beekhof wrote: >>>>>>> On 02/05/2013, at 5:45 PM, Johan Huysmans <johan.huysm...@inuits.be> >>>>>>> wrote: >>>>>>> >>>>>>>> On 2013-05-01 05:48, Andrew Beekhof wrote: >>>>>>>>> On 17/04/2013, at 9:54 PM, Johan Huysmans <johan.huysm...@inuits.be> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi All, >>>>>>>>>> >>>>>>>>>> I'm trying to setup a specific configuration in our cluster, however >>>>>>>>>> I'm struggling with my configuration. >>>>>>>>>> >>>>>>>>>> This is what I'm trying to achieve: >>>>>>>>>> On both nodes of the cluster a daemon must be running (tomcat). >>>>>>>>>> Some failover addresses are configured and must be running on the >>>>>>>>>> node with a correctly running tomcat. >>>>>>>>>> >>>>>>>>>> I have this achieved with a cloned tomcat resource and an >>>>>>>>>> collocation between the cloned tomcat and the failover addresses. >>>>>>>>>> When I cause a failure in the tomcat on the node running the >>>>>>>>>> failover addresses, the failover addresses will failover to the >>>>>>>>>> other node as expected. >>>>>>>>>> crm_mon shows that this tomcat has a failure. >>>>>>>>>> When I configure the tomcat resource with failure-timeout=0, the >>>>>>>>>> failure alarm in crm_mon isn't cleared whenever the tomcat failure >>>>>>>>>> is fixed. >>>>>>>>> All sounds right so far. >>>>>>>> If my broken tomcat is automatically fixed, I expect this to be >>>>>>>> noticed by pacemaker and that that node will be able to run my >>>>>>>> failover addresses, >>>>>>>> however I don't see this happening. >>>>>>> This is very hard to discuss without seeing logs. >>>>>>> >>>>>>> So you created a tomcat error, waited for pacemaker to notice, fixed >>>>>>> the error and observed the pacemaker did not re-notice? >>>>>>> How long did you wait? More than the 15s repeat interval I assume? Did >>>>>>> at least the resource agent notice? >>>>>>> >>>>>>>>>> When I configure the tomcat resource with failure-timeout=30, the >>>>>>>>>> failure alarm in crm_mon is cleared after 30seconds however the >>>>>>>>>> tomcat is still having a failure. >>>>>>>>> Can you define "still having a failure"? >>>>>>>>> You mean it still shows up in crm_mon? >>>>>>>>> Have you read this link? >>>>>>>>> >>>>>>>>> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-crmsh/html/Pacemaker_Explained/s-rules-recheck.html >>>>>>>> "Still having a failure" means that the tomcat is still broken and my >>>>>>>> OCF script reports it as a failure. >>>>>>>>>> What I expect is that pacemaker reports the failure as the failure >>>>>>>>>> exists and as long as it exists and that pacemaker reports that >>>>>>>>>> everything is ok once everything is back ok. >>>>>>>>>> >>>>>>>>>> Do I do something wrong with my configuration? >>>>>>>>>> Or how can I achieve my wanted setup? >>>>>>>>>> >>>>>>>>>> Here is my configuration: >>>>>>>>>> >>>>>>>>>> node CSE-1 >>>>>>>>>> node CSE-2 >>>>>>>>>> primitive d_tomcat ocf:custom:tomcat \ >>>>>>>>>> op monitor interval="15s" timeout="510s" on-fail="block" \ >>>>>>>>>> op start interval="0" timeout="510s" \ >>>>>>>>>> params instance_name="NMS" monitor_use_ssl="no" >>>>>>>>>> monitor_urls="/cse/health" monitor_timeout="120" \ >>>>>>>>>> meta migration-threshold="1" failure-timeout="0" >>>>>>>>>> primitive ip_1 ocf:heartbeat:IPaddr2 \ >>>>>>>>>> op monitor interval="10s" \ >>>>>>>>>> params nic="bond0" broadcast="10.1.1.1" iflabel="ha" ip="10.1.1.1" >>>>>>>>>> primitive ip_2 ocf:heartbeat:IPaddr2 \ >>>>>>>>>> op monitor interval="10s" \ >>>>>>>>>> params nic="bond0" broadcast="10.1.1.2" iflabel="ha" ip="10.1.1.2" >>>>>>>>>> group svc-cse ip_1 ip_2 >>>>>>>>>> clone cl_tomcat d_tomcat >>>>>>>>>> colocation colo_tomcat inf: svc-cse cl_tomcat >>>>>>>>>> order order_tomcat inf: cl_tomcat svc-cse >>>>>>>>>> property $id="cib-bootstrap-options" \ >>>>>>>>>> dc-version="1.1.8-7.el6-394e906" \ >>>>>>>>>> cluster-infrastructure="cman" \ >>>>>>>>>> no-quorum-policy="ignore" \ >>>>>>>>>> stonith-enabled="false" >>>>>>>>>> >>>>>>>>>> Thanks! >>>>>>>>>> >>>>>>>>>> Greetings, >>>>>>>>>> Johan Huysmans >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>>>>>> >>>>>>>>>> Project Home: http://www.clusterlabs.org >>>>>>>>>> Getting started: >>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>>>>>> Bugs: http://bugs.clusterlabs.org >>>>>>>>> _______________________________________________ >>>>>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>>>>> >>>>>>>>> Project Home: http://www.clusterlabs.org >>>>>>>>> Getting started: >>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>>>>> Bugs: http://bugs.clusterlabs.org >>>>>>>> _______________________________________________ >>>>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>>>> >>>>>>>> Project Home: http://www.clusterlabs.org >>>>>>>> Getting started: >>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>>>> Bugs: http://bugs.clusterlabs.org >>>>>>> _______________________________________________ >>>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>>> >>>>>>> Project Home: http://www.clusterlabs.org >>>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>>> Bugs: http://bugs.clusterlabs.org >>>>>> <step_2.log><step_3.log>_______________________________________________ >>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>> >>>>>> Project Home: http://www.clusterlabs.org >>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>> Bugs: http://bugs.clusterlabs.org >>>>> _______________________________________________ >>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>> >>>>> Project Home: http://www.clusterlabs.org >>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>> Bugs: http://bugs.clusterlabs.org >>>> <pe-input-1.bz2>_______________________________________________ >>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>> >>>> Project Home: http://www.clusterlabs.org >>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>> Bugs: http://bugs.clusterlabs.org >> >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org