Hi list, I have some trouble configuring a resource that is allowed to fail once in two minutes. The documentation states that I have to configure migration-threshold and failure-timeout to achieve this. Here is the configuration for the resource.
# pcs config Cluster Name: mycluster Corosync Nodes: Pacemaker Nodes: Node1 Node2 Node3 Resources: Clone: resClamd-clone Meta Attrs: clone-max=3 clone-node-max=1 interleave=true Resource: resClamd (class=lsb type=clamd) Meta Attrs: failure-timeout=120s migration-threshold=2 Operations: monitor on-fail=restart interval=60s (resClamd-monitor-on-fail-restart) Stonith Devices: Fencing Levels: Location Constraints: Ordering Constraints: Colocation Constraints: Cluster Properties: cluster-infrastructure: cman dc-version: 1.1.10-14.el6_5.1-368c726 last-lrm-refresh: 1390468150 stonith-enabled: false # pcs resource defaults resource-stickiness: INFINITY # pcs status Cluster name: mycluster Last updated: Thu Jan 23 10:12:49 2014 Last change: Thu Jan 23 10:11:40 2014 via cibadmin on Node2 Stack: cman Current DC: Node2 - partition with quorum Version: 1.1.10-14.el6_5.1-368c726 3 Nodes configured 3 Resources configured Online: [ Node1 Node2 Node3 ] Full list of resources: Clone Set: resClamd-clone [resClamd] Started: [ Node1 Node2 Node3 ] Stopping the clamd daemon sets the failcount to 1 and the daemon is started again. Ok. # service clamd stop Stopping Clam AntiVirus Daemon: [ OK ] /var/log/messages Jan 23 10:15:20 Node1 crmd[6075]: notice: process_lrm_event: Node1-resClamd_monitor_60000:305 [ clamd is stopped\n ] Jan 23 10:15:20 Node1 attrd[6073]: notice: attrd_cs_dispatch: Update relayed from Node2 Jan 23 10:15:20 Node1 attrd[6073]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-resClamd (1) Jan 23 10:15:20 Node1 attrd[6073]: notice: attrd_perform_update: Sent update 177: fail-count-resClamd=1 Jan 23 10:15:20 Node1 attrd[6073]: notice: attrd_cs_dispatch: Update relayed from Node2 Jan 23 10:15:20 Node1 attrd[6073]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-resClamd (1390468520) Jan 23 10:15:20 Node1 attrd[6073]: notice: attrd_perform_update: Sent update 179: last-failure-resClamd=1390468520 Jan 23 10:15:20 Node1 crmd[6075]: notice: process_lrm_event: Node1-resClamd_monitor_60000:305 [ clamd is stopped\n ] Jan 23 10:15:21 Node1 crmd[6075]: notice: process_lrm_event: LRM operation resClamd_stop_0 (call=310, rc=0, cib-update=110, confirmed=true) ok Jan 23 10:15:30 elmailtst1 crmd[6075]: notice: process_lrm_event: LRM operation resClamd_start_0 (call=314, rc=0, cib-update=111, confirmed=true) ok Jan 23 10:15:30 elmailtst1 crmd[6075]: notice: process_lrm_event: LRM operation resClamd_monitor_60000 (call=317, rc=0, cib-update=112, confirmed=false) ok # pcs status Cluster name: mycluster Last updated: Thu Jan 23 10:16:48 2014 Last change: Thu Jan 23 10:11:40 2014 via cibadmin on Node1 Stack: cman Current DC: Node2 - partition with quorum Version: 1.1.10-14.el6_5.1-368c726 3 Nodes configured 3 Resources configured Online: [ Node1 Node2 Node3 ] Full list of resources: Clone Set: resClamd-clone [resClamd] Started: [ Node1 Node2 Node3 ] Failed actions: resClamd_monitor_60000 on Node1 'not running' (7): call=305, status=complete, last-rc-change='Thu Jan 23 10:15:20 2014', queued=0ms, exec=0ms # pcs resource failcount show resClamd Failcounts for resClamd Node1: 1 After 7 Minutes I let it fail again and as I understood it should be started as well. But it doesn't. # service clamd stop Stopping Clam AntiVirus Daemon: [ OK ] Jan 23 10:22:30 Node1 crmd[6075]: notice: process_lrm_event: LRM operation resClamd_monitor_60000 (call=317, rc=7, cib-update=113, confirmed=false) not running Jan 23 10:22:30 Node1 crmd[6075]: notice: process_lrm_event: Node1-resClamd_monitor_60000:317 [ clamd is stopped\n ] Jan 23 10:22:30 Node1 attrd[6073]: notice: attrd_cs_dispatch: Update relayed from Node2 Jan 23 10:22:30 Node1 attrd[6073]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-resClamd (2) Jan 23 10:22:30 Node1 attrd[6073]: notice: attrd_perform_update: Sent update 181: fail-count-resClamd=2 Jan 23 10:22:30 Node1 attrd[6073]: notice: attrd_cs_dispatch: Update relayed from Node2 Jan 23 10:22:30 Node1 attrd[6073]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-resClamd (1390468950) Jan 23 10:22:30 Node1 attrd[6073]: notice: attrd_perform_update: Sent update 183: last-failure-resClamd=1390468950 Jan 23 10:22:30 Node1 crmd[6075]: notice: process_lrm_event: Node1-resClamd_monitor_60000:317 [ clamd is stopped\n ] Jan 23 10:22:30 Node1 crmd[6075]: notice: process_lrm_event: LRM operation resClamd_stop_0 (call=322, rc=0, cib-update=114, confirmed=true) ok # pcs status Cluster name: mycluster Last updated: Thu Jan 23 10:22:41 2014 Last change: Thu Jan 23 10:11:40 2014 via cibadmin on Node1 Stack: cman Current DC: Node2 - partition with quorum Version: 1.1.10-14.el6_5.1-368c726 3 Nodes configured 3 Resources configured Online: [ Node1 Node2 Node3 ] Full list of resources: Clone Set: resClamd-clone [resClamd] Started: [ Node2 Node3 ] Stopped: [ Node1 ] Failed actions: resClamd_monitor_60000 on Node1 'not running' (7): call=317, status=complete, last-rc-change='Thu Jan 23 10:22:30 2014', queued=0ms, exec=0ms What's wrong with my configuration? Thanks in advance Frank _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org