Modified the RA to log each action call performed and from this log there is no call of monitor action.
 
From the logs i do not think it is the policy engine, it might be the LRM part of crmd (the is the only relevant change be seen after git diff between 1.1.10-rc7 and 1.1.10).
 
Explanation of the below log:
primitive resABC ocf:heartbeat:Stateful \
        op start interval="0s" timeout="60s" on-fail="restart" \
        op monitor interval="30s" timeout="60s" on-fail="restart" \
        op promote interval="0s" timeout="60s" on-fail="restart" \
        op demote interval="0" timeout="60s" on-fail="restart" \
        op stop interval="0" timeout="60s" on-fail="restart" \
        op monitor interval="20" role="Master" timeout="60"
ms msABC resABC \
        meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true"
crm_mon at begin of log:
Last updated: Wed Jul 31 08:30:57 2013
Last change: Tue Jul 30 13:01:36 2013 via crmd on int2node1
Stack: corosync
Current DC: int2node1 (1743917066) - partition with quorum
Version: 1.1.10-1.el6-368c726
2 Nodes configured
5 Resources configured
Online: [ int2node1 int2node2 ]
 Master/Slave Set: msABC [resABC]
     Masters: [ int2node1 ]
     Slaves: [ int2node2 ]
crm_mon at end of log:
Last updated: Wed Jul 31 08:55:29 2013
Last change: Tue Jul 30 13:01:36 2013 via crmd on int2node1
Stack: corosync
Current DC: int2node1 (1743917066) - partition with quorum
Version: 1.1.10-1.el6-368c726
2 Nodes configured
5 Resources configured
Online: [ int2node1 ]
OFFLINE: [ int2node2 ]
Master/Slave Set: msABC [resABC]
     Masters: [ int2node1 ]
 
int2node1 is running, int2node2 is started
2013-07-31T08:30:52.631+02:00 int2node1 pengine[16443] notice:   notice: LogActions: Start   resABC:1   (int2node2)
2013-07-31T08:30:52.638+02:00 int2node1 crmd[16444] notice:   notice: te_rsc_command: Initiating action 9: monitor resABC:1_monitor_0 on int2node2
2013-07-31T08:30:52.638+02:00 int2node1 crmd[16444] notice:   notice: te_rsc_command: Initiating action 54: notify resABC_pre_notify_start_0 on int2node1 (local)
2013-07-31T08:30:52.681+02:00 int2node1 crmd[16444] notice:   notice: process_lrm_event: LRM operation resABC_notify_0 (call=64, rc=0, cib-update=0, confirmed=true) ok
2013-07-31T08:30:52.780+02:00 int2node1 crmd[16444] notice:   notice: te_rsc_command: Initiating action 25: start resABC:1_start_0 on int2node2
2013-07-31T08:30:52.940+02:00 int2node1 crmd[16444] notice:   notice: te_rsc_command: Initiating action 55: notify resABC_post_notify_start_0 on int2node1 (local)
2013-07-31T08:30:52.943+02:00 int2node1 crmd[16444] notice:   notice: te_rsc_command: Initiating action 56: notify resABC:1_post_notify_start_0 on int2node2
2013-07-31T08:30:52.982+02:00 int2node1 crmd[16444] notice:   notice: process_lrm_event: LRM operation resABC_notify_0 (call=67, rc=0, cib-update=0, confirmed=true) ok
2013-07-31T08:30:52.992+02:00 int2node1 crmd[16444] notice:   notice: te_rsc_command: Initiating action 24: monitor resABC_monitor_20000 on int2node1 (local)
2013-07-31T08:30:52.996+02:00 int2node1 crmd[16444] notice:   notice: te_rsc_command: Initiating action 26: monitor resABC:1_monitor_30000 on int2node2
2013-07-31T08:30:53.035+02:00 int2node1 crmd[16444] notice:   notice: process_lrm_event: LRM operation resABC_monitor_20000 (call=70, rc=8, cib-update=149, confirmed=false) master
 
At this point int2node2 is stopped.
2013-07-31T08:37:51.457+02:00 int2node1 crmd[16444] notice:   notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
2013-07-31T08:37:51.462+02:00 int2node1 pengine[16443] notice:   notice: unpack_config: On loss of CCM Quorum: Ignore
2013-07-31T08:37:51.465+02:00 int2node1 pengine[16443] notice:   notice: stage6: Scheduling Node int2node2 for shutdown
2013-07-31T08:37:51.466+02:00 int2node1 pengine[16443] notice:   notice: LogActions: Stop    resABC:1   (int2node2)
2013-07-31T08:37:51.469+02:00 int2node1 pengine[16443] notice:   notice: process_pe_message: Calculated Transition 86: /var/lib/pacemaker/pengine/pe-input-125.bz2
2013-07-31T08:37:51.471+02:00 int2node1 crmd[16444] notice:   notice: te_rsc_command: Initiating action 56: notify resABC_pre_notify_stop_0 on int2node1 (local)
2013-07-31T08:37:51.474+02:00 int2node1 crmd[16444] notice:   notice: te_rsc_command: Initiating action 58: notify resABC_pre_notify_stop_0 on int2node2
2013-07-31T08:37:51.512+02:00 int2node1 crmd[16444] notice:   notice: process_lrm_event: LRM operation resABC_notify_0 (call=74, rc=0, cib-update=0, confirmed=true) ok
2013-07-31T08:37:51.514+02:00 int2node1 crmd[16444] notice:   notice: te_rsc_command: Initiating action 23: stop resABC_stop_0 on int2node2
2013-07-31T08:37:51.654+02:00 int2node1 crmd[16444] notice:   notice: te_rsc_command: Initiating action 57: notify resABC_post_notify_stop_0 on int2node1 (local)
2013-07-31T08:37:51.699+02:00 int2node1 crmd[16444] notice:   notice: process_lrm_event: LRM operation resABC_notify_0 (call=78, rc=0, cib-update=0, confirmed=true) ok
2013-07-31T08:37:51.699+02:00 int2node1 crmd[16444] notice:   notice: run_graph: Transition 86 (Complete=13, Pending=0, Fired=0, Skipped=2, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-125.bz2): Stopped
2013-07-31T08:37:51.705+02:00 int2node1 pengine[16443] notice:   notice: unpack_config: On loss of CCM Quorum: Ignore
2013-07-31T08:37:51.705+02:00 int2node1 pengine[16443] notice:   notice: stage6: Scheduling Node int2node2 for shutdown
2013-07-31T08:37:51.706+02:00 int2node1 pengine[16443] notice:   notice: process_pe_message: Calculated Transition 87: /var/lib/pacemaker/pengine/pe-input-126.bz2
2013-07-31T08:37:51.707+02:00 int2node1 crmd[16444] notice:   notice: run_graph: Transition 87 (Complete=1, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-126.bz2): Complete
2013-07-31T08:37:51.707+02:00 int2node1 crmd[16444] notice:   notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
2013-07-31T08:37:51.720+02:00 int2node1 crmd[16444] notice:   notice: peer_update_callback: do_shutdown of int2node2 (op 45) is complete
 
Output from RA on int2node1:
Wed Jul 31 08:30:52 CEST 2013 resABC: operation notify, type pre, operation start
Wed Jul 31 08:30:52 CEST 2013 resABC: operation notify, type post, operation start
Wed Jul 31 08:30:53 CEST 2013 resABC: operation monitor, type , operation
Wed Jul 31 08:31:13 CEST 2013 resABC: operation monitor, type , operation
Wed Jul 31 08:31:33 CEST 2013 resABC: operation monitor, type , operation
Wed Jul 31 08:31:53 CEST 2013 resABC: operation monitor, type , operation
Wed Jul 31 08:32:13 CEST 2013 resABC: operation monitor, type , operation
Wed Jul 31 08:32:33 CEST 2013 resABC: operation monitor, type , operation
Wed Jul 31 08:32:53 CEST 2013 resABC: operation monitor, type , operation
Wed Jul 31 08:33:13 CEST 2013 resABC: operation monitor, type , operation
Wed Jul 31 08:33:33 CEST 2013 resABC: operation monitor, type , operation
Wed Jul 31 08:33:53 CEST 2013 resABC: operation monitor, type , operation
Wed Jul 31 08:34:13 CEST 2013 resABC: operation monitor, type , operation
Wed Jul 31 08:34:33 CEST 2013 resABC: operation monitor, type , operation
Wed Jul 31 08:34:53 CEST 2013 resABC: operation monitor, type , operation
Wed Jul 31 08:35:13 CEST 2013 resABC: operation monitor, type , operation
Wed Jul 31 08:35:33 CEST 2013 resABC: operation monitor, type , operation
Wed Jul 31 08:35:53 CEST 2013 resABC: operation monitor, type , operation
Wed Jul 31 08:36:13 CEST 2013 resABC: operation monitor, type , operation
Wed Jul 31 08:36:33 CEST 2013 resABC: operation monitor, type , operation
Wed Jul 31 08:36:53 CEST 2013 resABC: operation monitor, type , operation
Wed Jul 31 08:37:13 CEST 2013 resABC: operation monitor, type , operation
Wed Jul 31 08:37:33 CEST 2013 resABC: operation monitor, type , operation
Wed Jul 31 08:37:51 CEST 2013 resABC: operation notify, type pre, operation stop
Wed Jul 31 08:37:51 CEST 2013 resABC: operation notify, type post, operation stop
 
After 08:37:51 no log output from Pacemaker for resABC, nor any output from RA on int2node1.
 
Gesendet: Mittwoch, 31. Juli 2013 um 02:10 Uhr
Von: "Andrew Beekhof" <and...@beekhof.net>
An: "The Pacemaker cluster resource manager" <pacemaker@oss.clusterlabs.org>
Betreff: Re: [Pacemaker] Announce: Pacemaker 1.1.10 now available

On 30/07/2013, at 9:13 PM, Rainer Brestan <rainer.bres...@gmx.net> wrote:

> I can agree, Master monitor operation is broken in 1.1.10 release.
> When the slave monitor action is started, the master monitor action is not called any more.

Based on?

>
> I have created a setup with Stateful resource with two nodes.
> Then the Pacemaker installation is changed to different versions without changing the configuration part of the CIB.
>
> Result:
> 1.1.10-rc5, 1.1.10-rc6 and 1.1.10-rc7 does not have this error
> 1.1.10-1 release has the error
>
> Installation order (just that anybody know how it was done):
> 1.1.10-1 -> error
> 1.1.10-rc5 -> no error
> 1.1.10-rc6 -> no error
> 1.1.10-rc7 -> no error
> 1.1.10-1 -> error
>
> Rainer
> Gesendet: Freitag, 26. Juli 2013 um 09:32 Uhr
> Von: "Takatoshi MATSUO" <matsuo....@gmail.com>
> An: "The Pacemaker cluster resource manager" <pacemaker@oss.clusterlabs.org>
> Betreff: Re: [Pacemaker] Announce: Pacemaker 1.1.10 now available
> Hi
>
> I used Stateful RA and caught a same issue.
>
> 1. before starting slave
>
> # crm_simulate -VVV -S -x /var/lib/pacemaker/pengine/pe-input-1543.bz2
> | grep "Resource action"
> * Resource action: stateful monitor=2000 on 16-sl6
>
> 2. starting slave
> # crm_simulate -VVV -S -x /var/lib/pacemaker/pengine/pe-input-1544.bz2
> | grep "Resource action"
> * Resource action: stateful monitor on 17-sl6
> * Resource action: stateful notify on 16-sl6
> * Resource action: stateful start on 17-sl6
> * Resource action: stateful notify on 16-sl6
> * Resource action: stateful notify on 17-sl6
> * Resource action: stateful monitor=3000 on 17-sl6
>
> 3. after
> # crm_simulate -VVV -S -x /var/lib/pacemaker/pengine/pe-input-1545.bz2
> | grep "Resource action"
> * Resource action: stateful monitor=3000 on 17-sl6
>
> Monitor=2000 is deleted.
> Is this correct ?
>
>
> My setting
> --------
> property \
> no-quorum-policy="ignore" \
> stonith-enabled="false"
>
> rsc_defaults \
> resource-stickiness="INFINITY" \
> migration-threshold="1"
>
> ms msStateful stateful \
> meta \
> master-max="1" \
> master-node-max="1" \
> clone-max="2" \
> clone-node-max="1" \
> notify="true"
>
> primitive stateful ocf:heartbeat:Stateful \
> op start timeout="60s" interval="0s" on-fail="restart" \
> op monitor timeout="60s" interval="3s" on-fail="restart" \
> op monitor timeout="60s" interval="2s" on-fail="restart" role="Master" \
> op promote timeout="60s" interval="0s" on-fail="restart" \
> op demote timeout="60s" interval="0s" on-fail="stop" \
> op stop timeout="60s" interval="0s" on-fail="block"
> --------
>
> Regards,
> Takatoshi MATSUO
>
> 2013/7/26 Takatoshi MATSUO <matsuo....@gmail.com>:
> > Hi
> >
> > My report is late for 1.1.10 :(
> >
> > I am using pacemaker 1.1.10-0.1.ab2e209.git.
> > It seems that master's monitor is stopped when slave is started.
> >
> > Does someone encounter same problem ?
> > I attach a log and settings.
> >
> >
> > Thanks,
> > Takatoshi MATSUO
> >
> > 2013/7/26 Digimer <li...@alteeve.ca>:
> >> Congrats!! I know this was a long time in the making.
> >>
> >> digimer
> >>
> >>
> >> On 25/07/13 20:43, Andrew Beekhof wrote:
> >>>
> >>> Announcing the release of Pacemaker 1.1.10
> >>>
> >>> https://github.com/ClusterLabs/pacemaker/releases/Pacemaker-1.1.10
> >>>
> >>> There were three changes of note since rc7:
> >>>
> >>> + Bug cl#5161 - crmd: Prevent memory leak in operation cache
> >>> + cib: Correctly read back archived configurations if the primary is
> >>> corrupted
> >>> + cman: Do not pretend we know the state of nodes we've never seen
> >>>
> >>> Along with assorted bug fixes, the major topics for this release were:
> >>>
> >>> - stonithd fixes
> >>> - fixing memory leaks, often caused by incorrect use of glib reference
> >>> counting
> >>> - supportability improvements (code cleanup and deduplication,
> >>> standardized error codes)
> >>>
> >>> Release candidates for the next Pacemaker release (1.1.11) can be
> >>> expected some time around Novemeber.
> >>>
> >>> A big thankyou to everyone that spent time testing the release
> >>> candidates and/or contributed patches. However now that Pacemaker is
> >>> perfect, anyone reporting bugs will be shot :-)
> >>>
> >>> To build `rpm` packages:
> >>>
> >>> 1. Clone the current sources:
> >>>
> >>> # git clone --depth 0 git://github.com/ClusterLabs/pacemaker.git
> >>> # cd pacemaker
> >>>
> >>> 1. Install dependancies (if you haven't already)
> >>>
> >>> [Fedora] # sudo yum install -y yum-utils
> >>> [ALL] # make rpm-dep
> >>>
> >>> 1. Build Pacemaker
> >>>
> >>> # make release
> >>>
> >>> 1. Copy and deploy as needed
> >>>
> >>> ## Details - 1.1.10 - final
> >>>
> >>> Changesets: 602
> >>> Diff: 143 files changed, 8162 insertions(+), 5159 deletions(-)
> >>>
> >>> ## Highlights
> >>>
> >>> ### Features added since Pacemaker-1.1.9
> >>>
> >>> + Core: Convert all exit codes to positive errno values
> >>> + crm_error: Add the ability to list and print error symbols
> >>> + crm_resource: Allow individual resources to be reprobed
> >>> + crm_resource: Allow options to be set recursively
> >>> + crm_resource: Implement --ban for moving resources away from nodes
> >>> and --clear (replaces --unmove)
> >>> + crm_resource: Support OCF tracing when using
> >>> --force-(check|start|stop)
> >>> + PE: Allow active nodes in our current membership to be fenced without
> >>> quorum
> >>> + PE: Suppress meaningless IDs when displaying anonymous clone status
> >>> + Turn off auto-respawning of systemd services when the cluster starts
> >>> them
> >>> + Bug cl#5128 - pengine: Support maintenance mode for a single node
> >>>
> >>> ### Changes since Pacemaker-1.1.9
> >>>
> >>> + crmd: cib: stonithd: Memory leaks resolved and improved use of glib
> >>> reference counting
> >>> + attrd: Fixes deleted attributes during dc election
> >>> + Bug cf#5153 - Correctly display clone failcounts in crm_mon
> >>> + Bug cl#5133 - pengine: Correctly observe on-fail=block for failed
> >>> demote operation
> >>> + Bug cl#5148 - legacy: Correctly remove a node that used to have a
> >>> different nodeid
> >>> + Bug cl#5151 - Ensure node names are consistently compared without
> >>> case
> >>> + Bug cl#5152 - crmd: Correctly clean up fenced nodes during membership
> >>> changes
> >>> + Bug cl#5154 - Do not expire failures when on-fail=block is present
> >>> + Bug cl#5155 - pengine: Block the stop of resources if any depending
> >>> resource is unmanaged
> >>> + Bug cl#5157 - Allow migration in the absence of some colocation
> >>> constraints
> >>> + Bug cl#5161 - crmd: Prevent memory leak in operation cache
> >>> + Bug cl#5164 - crmd: Fixes crash when using pacemaker-remote
> >>> + Bug cl#5164 - pengine: Fixes segfault when calculating transition
> >>> with remote-nodes.
> >>> + Bug cl#5167 - crm_mon: Only print "stopped" node list for incomplete
> >>> clone sets
> >>> + Bug cl#5168 - Prevent clones from being bounced around the cluster
> >>> due to location constraints
> >>> + Bug cl#5170 - Correctly support on-fail=block for clones
> >>> + cib: Correctly read back archived configurations if the primary is
> >>> corrupted
> >>> + cib: The result is not valid when diffs fail to apply cleanly for CLI
> >>> tools
> >>> + cib: Restore the ability to embed comments in the configuration
> >>> + cluster: Detect and warn about node names with capitals
> >>> + cman: Do not pretend we know the state of nodes we've never seen
> >>> + cman: Do not unconditionally start cman if it is already running
> >>> + cman: Support non-blocking CPG calls
> >>> + Core: Ensure the blackbox is saved on abnormal program termination
> >>> + corosync: Detect the loss of members for which we only know the
> >>> nodeid
> >>> + corosync: Do not pretend we know the state of nodes we've never seen
> >>> + corosync: Ensure removed peers are erased from all caches
> >>> + corosync: Nodes that can persist in sending CPG messages must be
> >>> alive afterall
> >>> + crmd: Do not get stuck in S_POLICY_ENGINE if a node we couldn't fence
> >>> returns
> >>> + crmd: Do not update fail-count and last-failure for old failures
> >>> + crmd: Ensure all membership operations can complete while trying to
> >>> cancel a transition
> >>> + crmd: Ensure operations for cleaned up resources don't block recovery
> >>> + crmd: Ensure we return to a stable state if there have been too many
> >>> fencing failures
> >>> + crmd: Initiate node shutdown if another node claims to have
> >>> successfully fenced us
> >>> + crmd: Prevent messages for remote crmd clients from being relayed to
> >>> wrong daemons
> >>> + crmd: Properly handle recurring monitor operations for remote-node
> >>> agent
> >>> + crmd: Store last-run and last-rc-change for all operations
> >>> + crm_mon: Ensure stale pid files are updated when a new process is
> >>> started
> >>> + crm_report: Correctly collect logs when 'uname -n' reports fully
> >>> qualified names
> >>> + fencing: Fail the operation once all peers have been exhausted
> >>> + fencing: Restore the ability to manually confirm that fencing
> >>> completed
> >>> + ipc: Allow unpriviliged clients to clean up after server failures
> >>> + ipc: Restore the ability for members of the haclient group to connect
> >>> to the cluster
> >>> + legacy: Support "crm_node --remove" with a node name for corosync
> >>> plugin (bnc#805278)
> >>> + lrmd: Default to the upstream location for resource agent scratch
> >>> directory
> >>> + lrmd: Pass errors from lsb metadata generation back to the caller
> >>> + pengine: Correctly handle resources that recover before we operate on
> >>> them
> >>> + pengine: Delete the old resource state on every node whenever the
> >>> resource type is changed
> >>> + pengine: Detect constraints with inappropriate actions (ie. promote
> >>> for a clone)
> >>> + pengine: Ensure per-node resource parameters are used during probes
> >>> + pengine: If fencing is unavailable or disabled, block further
> >>> recovery for resources that fail to stop
> >>> + pengine: Implement the rest of get_timet_now() and rename to
> >>> get_effective_time
> >>> + pengine: Re-initiate _active_ recurring monitors that previously
> >>> failed but have timed out
> >>> + remote: Workaround for inconsistent tls handshake behavior between
> >>> gnutls versions
> >>> + systemd: Ensure we get shut down correctly by systemd
> >>> + systemd: Reload systemd after adding/removing override files for
> >>> cluster services
> >>> + xml: Check for and replace non-printing characters with their octal
> >>> equivalent while exporting xml text
> >>> + xml: Prevent lockups by setting a more reliable buffer allocation
> >>> strategy
> >>>
> >>>
> >>> _______________________________________________
> >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >>>
> >>> Project Home: http://www.clusterlabs.org
> >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >>> Bugs: http://bugs.clusterlabs.org
> >>>
> >>
> >>
> >> --
> >> Digimer
> >> Papers and Projects: https://alteeve.ca/w/
> >> What if the cure for cancer is trapped in the mind of a person without
> >> access to education?
> >>
> >>
> >> _______________________________________________
> >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >>
> >> Project Home: http://www.clusterlabs.org
> >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >> Bugs: http://bugs.clusterlabs.org
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to