Re: [Pacemaker] failure handling on a cloned resource

Johan Huysmans Fri, 03 May 2013 05:47:04 -0700

Hi,

Below you can see my setup and my test, this shows that my clonedresource with on-fail=block does not recover automatically.


My Setup:

# rpm -aq | grep -i pacemaker
pacemaker-libs-1.1.9-1512.el6.i686
pacemaker-cluster-libs-1.1.9-1512.el6.i686
pacemaker-cli-1.1.9-1512.el6.i686
pacemaker-1.1.9-1512.el6.i686

# crm configure show
node CSE-1
node CSE-2
primitive d_tomcat ocf:ntc:tomcat \
    op monitor interval="15s" timeout="510s" on-fail="block" \
    op start interval="0" timeout="510s" \

params instance_name="NMS" monitor_use_ssl="no"monitor_urls="/cse/health" monitor_timeout="120" \

    meta migration-threshold="1"
primitive ip_11 ocf:heartbeat:IPaddr2 \
    op monitor interval="10s" \

params broadcast="172.16.11.31" ip="172.16.11.31" nic="bond0.111"iflabel="ha" \

    meta migration-threshold="1" failure-timeout="10"
primitive ip_19 ocf:heartbeat:IPaddr2 \
    op monitor interval="10s" \

params broadcast="172.18.19.31" ip="172.18.19.31" nic="bond0.119"iflabel="ha" \

    meta migration-threshold="1" failure-timeout="10"
group svc-cse ip_19 ip_11
clone cl_tomcat d_tomcat
colocation colo_tomcat inf: svc-cse cl_tomcat
order order_tomcat inf: cl_tomcat svc-cse
property $id="cib-bootstrap-options" \
    dc-version="1.1.9-1512.el6-2a917dd" \
    cluster-infrastructure="cman" \
    pe-warn-series-max="9" \
    no-quorum-policy="ignore" \
    stonith-enabled="false" \
    pe-input-series-max="9" \
    pe-error-series-max="9" \
    last-lrm-refresh="1367582088"

Currently only 1 node is available, CSE-1.


This is how I am currently testing my setup:

=> Starting point: Everything up and running

# crm resource status
 Resource Group: svc-cse
     ip_19    (ocf::heartbeat:IPaddr2):    Started
     ip_11    (ocf::heartbeat:IPaddr2):    Started
 Clone Set: cl_tomcat [d_tomcat]
     Started: [ CSE-1 ]
     Stopped: [ d_tomcat:1 ]

=> Causing failure: Change system so tomcat is running but has a failure(in attachment step_2.log)


# crm resource status
 Resource Group: svc-cse
     ip_19    (ocf::heartbeat:IPaddr2):    Stopped
     ip_11    (ocf::heartbeat:IPaddr2):    Stopped
 Clone Set: cl_tomcat [d_tomcat]
     d_tomcat:0    (ocf::ntc:tomcat):    Started (unmanaged) FAILED
     Stopped: [ d_tomcat:1 ]

=> Fixing failure: Revert system so tomcat is running without failure(in attachment step_3.log)


# crm resource status
 Resource Group: svc-cse
     ip_19    (ocf::heartbeat:IPaddr2):    Stopped
     ip_11    (ocf::heartbeat:IPaddr2):    Stopped
 Clone Set: cl_tomcat [d_tomcat]
     d_tomcat:0    (ocf::ntc:tomcat):    Started (unmanaged) FAILED
     Stopped: [ d_tomcat:1 ]

As you can see in the logs the OCF script doesn't return any failure.This is noticed by pacemaker,however it doesn't reflect in crm_mon and it doesn't start the dependingresources.


Gr.
Johan

On 2013-05-03 03:04, Andrew Beekhof wrote:

On 02/05/2013, at 5:45 PM, Johan Huysmans <johan.huysm...@inuits.be> wrote:

On 2013-05-01 05:48, Andrew Beekhof wrote:

On 17/04/2013, at 9:54 PM, Johan Huysmans <johan.huysm...@inuits.be> wrote:

Hi All,

I'm trying to setup a specific configuration in our cluster, however I'm 
struggling with my configuration.

This is what I'm trying to achieve:
On both nodes of the cluster a daemon must be running (tomcat).
Some failover addresses are configured and must be running on the node with a 
correctly running tomcat.

I have this achieved with a cloned tomcat resource and an collocation between 
the cloned tomcat and the failover addresses.
When I cause a failure in the tomcat on the node running the failover 
addresses, the failover addresses will failover to the other node as expected.
crm_mon shows that this tomcat has a failure.
When I configure the tomcat resource with failure-timeout=0, the failure alarm 
in crm_mon isn't cleared whenever the tomcat failure is fixed.

All sounds right so far.

If my broken tomcat is automatically fixed, I expect this to be noticed by 
pacemaker and that that node will be able to run my failover addresses,
however I don't see this happening.

This is very hard to discuss without seeing logs.

So you created a tomcat error, waited for pacemaker to notice, fixed the error 
and observed the pacemaker did not re-notice?
How long did you wait? More than the 15s repeat interval I assume?  Did at 
least the resource agent notice?

When I configure the tomcat resource with failure-timeout=30, the failure alarm 
in crm_mon is cleared after 30seconds however the tomcat is still having a 
failure.

Can you define "still having a failure"?
You mean it still shows up in crm_mon?
Have you read this link?
    
http://clusterlabs.org/doc/en-US/Pacemaker/1.1-crmsh/html/Pacemaker_Explained/s-rules-recheck.html

"Still having a failure" means that the tomcat is still broken and my OCF 
script reports it as a failure.

What I expect is that pacemaker reports the failure as the failure exists and 
as long as it exists and that pacemaker reports that everything is ok once 
everything is back ok.

Do I do something wrong with my configuration?
Or how can I achieve my wanted setup?

Here is my configuration:

node CSE-1
node CSE-2
primitive d_tomcat ocf:custom:tomcat \
    op monitor interval="15s" timeout="510s" on-fail="block" \
    op start interval="0" timeout="510s" \
    params instance_name="NMS" monitor_use_ssl="no" monitor_urls="/cse/health" 
monitor_timeout="120" \
    meta migration-threshold="1" failure-timeout="0"
primitive ip_1 ocf:heartbeat:IPaddr2 \
    op monitor interval="10s" \
    params nic="bond0" broadcast="10.1.1.1" iflabel="ha" ip="10.1.1.1"
primitive ip_2 ocf:heartbeat:IPaddr2 \
    op monitor interval="10s" \
    params nic="bond0" broadcast="10.1.1.2" iflabel="ha" ip="10.1.1.2"
group svc-cse ip_1 ip_2
clone cl_tomcat d_tomcat
colocation colo_tomcat inf: svc-cse cl_tomcat
order order_tomcat inf: cl_tomcat svc-cse
property $id="cib-bootstrap-options" \
    dc-version="1.1.8-7.el6-394e906" \
    cluster-infrastructure="cman" \
    no-quorum-policy="ignore" \
    stonith-enabled="false"

Thanks!

Greetings,
Johan Huysmans

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

May  3 12:01:30 CSE-1 tomcat(d_tomcat)[22367]: ERROR: TOMCAT is running, but healthpage for /cse/health failed
May  3 12:01:31 CSE-1 crmd[4898]:   notice: process_lrm_event: LRM operation d_tomcat_monitor_15000 (call=113, rc=1, cib-update=467, confirmed=false) unknown error
May  3 12:01:31 CSE-1 crmd[4898]:   notice: process_lrm_event: CSE-1-d_tomcat_monitor_15000:113 [ /cse/health HTTP Failures: localhost localhost: Request failed: Can't connect to localhost:80 (connect: Connection refused)  ]
May  3 12:01:31 CSE-1 crmd[4898]:  warning: update_failcount: Updating failcount for d_tomcat on CSE-1 after failed monitor: rc=1 (update=value++, time=1367582491)
May  3 12:01:31 CSE-1 crmd[4898]:   notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
May  3 12:01:31 CSE-1 attrd[4896]:   notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-d_tomcat (1)
May  3 12:01:31 CSE-1 pengine[4897]:   notice: unpack_config: On loss of CCM Quorum: Ignore
May  3 12:01:31 CSE-1 pengine[4897]:     crit: get_timet_now: Defaulting to 'now'
May  3 12:01:31 CSE-1 pengine[4897]:     crit: get_timet_now: Defaulting to 'now'
May  3 12:01:31 CSE-1 pengine[4897]:     crit: get_timet_now: Defaulting to 'now'
May  3 12:01:31 CSE-1 pengine[4897]:     crit: get_timet_now: Defaulting to 'now'
May  3 12:01:31 CSE-1 pengine[4897]:     crit: get_timet_now: Defaulting to 'now'
May  3 12:01:31 CSE-1 pengine[4897]:     crit: get_timet_now: Defaulting to 'now'
May  3 12:01:31 CSE-1 pengine[4897]:  warning: unpack_rsc_op: Processing failed op monitor for d_tomcat:0 on CSE-1: unknown error (1)
May  3 12:01:31 CSE-1 attrd[4896]:   notice: attrd_perform_update: Sent update 297: fail-count-d_tomcat=1
May  3 12:01:31 CSE-1 attrd[4896]:   notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-d_tomcat (1367582491)
May  3 12:01:31 CSE-1 pengine[4897]:   notice: LogActions: Restart ip_19#011(Started CSE-1)
May  3 12:01:31 CSE-1 pengine[4897]:   notice: LogActions: Restart ip_11#011(Started CSE-1)
May  3 12:01:31 CSE-1 pengine[4897]:   notice: LogActions: Start   d_tomcat:1#011(CSE-1)
May  3 12:01:31 CSE-1 attrd[4896]:   notice: attrd_perform_update: Sent update 299: last-failure-d_tomcat=1367582491
May  3 12:01:31 CSE-1 pengine[4897]:   notice: process_pe_message: Calculated Transition 34: /var/lib/pacemaker/pengine/pe-input-7.bz2
May  3 12:01:31 CSE-1 pengine[4897]:   notice: unpack_config: On loss of CCM Quorum: Ignore
May  3 12:01:31 CSE-1 pengine[4897]:     crit: get_timet_now: Defaulting to 'now'
May  3 12:01:31 CSE-1 pengine[4897]:     crit: get_timet_now: Defaulting to 'now'
May  3 12:01:31 CSE-1 pengine[4897]:     crit: get_timet_now: Defaulting to 'now'
May  3 12:01:31 CSE-1 pengine[4897]:     crit: get_timet_now: Defaulting to 'now'
May  3 12:01:31 CSE-1 pengine[4897]:     crit: get_timet_now: Defaulting to 'now'
May  3 12:01:31 CSE-1 pengine[4897]:     crit: get_timet_now: Defaulting to 'now'
May  3 12:01:31 CSE-1 pengine[4897]:  warning: unpack_rsc_op: Processing failed op monitor for d_tomcat:0 on CSE-1: unknown error (1)
May  3 12:01:31 CSE-1 pengine[4897]:  warning: common_apply_stickiness: Forcing cl_tomcat away from CSE-1 after 1 failures (max=1)
May  3 12:01:31 CSE-1 pengine[4897]:  warning: common_apply_stickiness: Forcing cl_tomcat away from CSE-1 after 1 failures (max=1)
May  3 12:01:31 CSE-1 pengine[4897]:   notice: LogActions: Stop    ip_19#011(CSE-1)
May  3 12:01:31 CSE-1 pengine[4897]:   notice: LogActions: Stop    ip_11#011(CSE-1)
May  3 12:01:31 CSE-1 pengine[4897]:   notice: process_pe_message: Calculated Transition 35: /var/lib/pacemaker/pengine/pe-input-8.bz2
May  3 12:01:31 CSE-1 pengine[4897]:   notice: unpack_config: On loss of CCM Quorum: Ignore
May  3 12:01:31 CSE-1 pengine[4897]:     crit: get_timet_now: Defaulting to 'now'
May  3 12:01:31 CSE-1 pengine[4897]:     crit: get_timet_now: Defaulting to 'now'
May  3 12:01:31 CSE-1 pengine[4897]:     crit: get_timet_now: Defaulting to 'now'
May  3 12:01:31 CSE-1 pengine[4897]:     crit: get_timet_now: Defaulting to 'now'
May  3 12:01:31 CSE-1 pengine[4897]:     crit: get_timet_now: Defaulting to 'now'
May  3 12:01:31 CSE-1 pengine[4897]:     crit: get_timet_now: Defaulting to 'now'
May  3 12:01:31 CSE-1 pengine[4897]:  warning: unpack_rsc_op: Processing failed op monitor for d_tomcat:0 on CSE-1: unknown error (1)
May  3 12:01:31 CSE-1 pengine[4897]:  warning: common_apply_stickiness: Forcing cl_tomcat away from CSE-1 after 1 failures (max=1)
May  3 12:01:31 CSE-1 pengine[4897]:  warning: common_apply_stickiness: Forcing cl_tomcat away from CSE-1 after 1 failures (max=1)
May  3 12:01:31 CSE-1 pengine[4897]:   notice: LogActions: Stop    ip_19#011(CSE-1)
May  3 12:01:31 CSE-1 pengine[4897]:   notice: LogActions: Stop    ip_11#011(CSE-1)
May  3 12:01:31 CSE-1 crmd[4898]:   notice: te_rsc_command: Initiating action 12: stop ip_11_stop_0 on CSE-1 (local)
May  3 12:01:31 CSE-1 IPaddr2(ip_11)[22531]: INFO: IP status = ok, IP_CIP=
May  3 12:01:31 CSE-1 crmd[4898]:   notice: process_lrm_event: LRM operation ip_11_stop_0 (call=159, rc=0, cib-update=472, confirmed=true) ok
May  3 12:01:31 CSE-1 crmd[4898]:   notice: te_rsc_command: Initiating action 10: stop ip_19_stop_0 on CSE-1 (local)
May  3 12:01:31 CSE-1 IPaddr2(ip_19)[22663]: INFO: IP status = ok, IP_CIP=
May  3 12:01:31 CSE-1 crmd[4898]:   notice: process_lrm_event: LRM operation ip_19_stop_0 (call=169, rc=0, cib-update=474, confirmed=true) ok
May  3 12:01:31 CSE-1 crmd[4898]:   notice: run_graph: Transition 36 (Complete=7, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-0.bz2): Complete
May  3 12:01:31 CSE-1 crmd[4898]:   notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
May  3 12:01:46 CSE-1 tomcat(d_tomcat)[22707]: ERROR: TOMCAT is running, but healthpage for /cse/health failed
May  3 12:02:01 CSE-1 tomcat(d_tomcat)[22841]: ERROR: TOMCAT is running, but healthpage for /cse/health failed
May  3 12:02:16 CSE-1 tomcat(d_tomcat)[22968]: ERROR: TOMCAT is running, but healthpage for /cse/health failed
May  3 12:02:31 CSE-1 tomcat(d_tomcat)[23100]: ERROR: TOMCAT is running, but healthpage for /cse/health failed
May  3 12:02:46 CSE-1 tomcat(d_tomcat)[23227]: ERROR: TOMCAT is running, but healthpage for /cse/health failed
May  3 12:03:01 CSE-1 tomcat(d_tomcat)[23359]: ERROR: TOMCAT is running, but healthpage for /cse/health failed
May  3 12:03:17 CSE-1 tomcat(d_tomcat)[23511]: ERROR: TOMCAT is running, but healthpage for /cse/health failed
May  3 12:03:32 CSE-1 tomcat(d_tomcat)[23643]: ERROR: TOMCAT is running, but healthpage for /cse/health failed
May  3 12:03:47 CSE-1 tomcat(d_tomcat)[23770]: ERROR: TOMCAT is running, but healthpage for /cse/health failed
May  3 12:04:02 CSE-1 tomcat(d_tomcat)[23903]: ERROR: TOMCAT is running, but healthpage for /cse/health failed
May  3 12:04:17 CSE-1 tomcat(d_tomcat)[24031]: ERROR: TOMCAT is running, but healthpage for /cse/health failed

May  3 12:13:38 CSE-1 tomcat(d_tomcat)[29336]: ERROR: TOMCAT is running, but healthpage for /cse/health failed
May  3 12:13:53 CSE-1 tomcat(d_tomcat)[29471]: ERROR: TOMCAT is running, but healthpage for /cse/health failed
May  3 12:14:08 CSE-1 tomcat(d_tomcat)[29598]: ERROR: TOMCAT is running, but healthpage for /cse/health failed
May  3 12:14:23 CSE-1 tomcat(d_tomcat)[29730]: ERROR: TOMCAT is running, but healthpage for /cse/health failed
May  3 12:14:38 CSE-1 tomcat(d_tomcat)[29857]: ERROR: TOMCAT is running, but healthpage for /cse/health failed
May  3 12:14:53 CSE-1 tomcat(d_tomcat)[30014]: ERROR: TOMCAT is running, but healthpage for /cse/health failed
May  3 12:15:09 CSE-1 crmd[4898]:   notice: process_lrm_event: LRM operation d_tomcat_monitor_15000 (call=113, rc=0, cib-update=475, confirmed=false) ok
May  3 12:15:09 CSE-1 crmd[4898]:   notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
May  3 12:15:09 CSE-1 pengine[4897]:   notice: unpack_config: On loss of CCM Quorum: Ignore
May  3 12:15:09 CSE-1 pengine[4897]:     crit: get_timet_now: Defaulting to 'now'
May  3 12:15:09 CSE-1 pengine[4897]:     crit: get_timet_now: Defaulting to 'now'
May  3 12:15:09 CSE-1 pengine[4897]:     crit: get_timet_now: Defaulting to 'now'
May  3 12:15:09 CSE-1 pengine[4897]:     crit: get_timet_now: Defaulting to 'now'
May  3 12:15:09 CSE-1 pengine[4897]:     crit: get_timet_now: Defaulting to 'now'
May  3 12:15:09 CSE-1 pengine[4897]:     crit: get_timet_now: Defaulting to 'now'
May  3 12:15:09 CSE-1 pengine[4897]:  warning: unpack_rsc_op: Processing failed op monitor for d_tomcat:0 on CSE-1: unknown error (1)
May  3 12:15:09 CSE-1 pengine[4897]:  warning: common_apply_stickiness: Forcing cl_tomcat away from CSE-1 after 1 failures (max=1)
May  3 12:15:09 CSE-1 pengine[4897]:  warning: common_apply_stickiness: Forcing cl_tomcat away from CSE-1 after 1 failures (max=1)
May  3 12:15:09 CSE-1 crmd[4898]:   notice: run_graph: Transition 37 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-1.bz2): Complete
May  3 12:15:09 CSE-1 crmd[4898]:   notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
May  3 12:15:09 CSE-1 pengine[4897]:   notice: process_pe_message: Calculated Transition 37: /var/lib/pacemaker/pengine/pe-input-1.bz2

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] failure handling on a cloned resource

Reply via email to