On 01/05/2013, at 2:51 AM, Brian J. Murrell <br...@interlinx.bc.ca> wrote:

> Using 1.1.8 on EL6.4, I am seeing this sort of thing:
> 
> pengine[1590]:  warning: unpack_rsc_op: Processing failed op monitor for 
> my_resource on node1: unknown error (1)
> 
> The full log from the point of adding the resource until the errors:
> 
> Apr 30 11:46:30 node1 cibadmin[3380]:   notice: crm_log_args: Invoked: 
> cibadmin -o resources -C -x /tmp/tmpHrgNZv 
> Apr 30 11:46:30 node1 crmd[1591]:   notice: do_state_transition: State 
> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL 
> origin=abort_transition_graph ]
> Apr 30 11:46:30 node1 cib[1586]:   notice: cib:diff: Diff: --- 0.24.5
> Apr 30 11:46:30 node1 cib[1586]:   notice: cib:diff: Diff: +++ 0.25.1 
> 8a4aac3dcddc2689e4b336e1bf2078ff
> Apr 30 11:46:30 node1 cib[1586]:   notice: cib:diff: -- <cib admin_epoch="0" 
> epoch="24" num_updates="5" />
> Apr 30 11:46:30 node1 cib[1586]:   notice: cib:diff: ++       <primitive 
> class="ocf" provider="my_provider" type="my_RA" id="my_resource" >
> Apr 30 11:46:30 node1 cib[1586]:   notice: cib:diff: ++         
> <meta_attributes id="my_resource-meta_attributes" >
> Apr 30 11:46:30 node1 cib[1586]:   notice: cib:diff: ++           <nvpair 
> name="my_RA-role" id="my_resource-meta_attributes-my_RA-role" value="Stopped" 
> />
> Apr 30 11:46:30 node1 cib[1586]:   notice: cib:diff: ++         
> </meta_attributes>
> Apr 30 11:46:30 node1 cib[1586]:   notice: cib:diff: ++         <operations 
> id="my_resource-operations" >
> Apr 30 11:46:30 node1 cib[1586]:   notice: cib:diff: ++           <op 
> id="my_resource-monitor-5" interval="5" name="monitor" timeout="60" />
> Apr 30 11:46:30 node1 cib[1586]:   notice: cib:diff: ++           <op 
> id="my_resource-start-0" interval="0" name="start" timeout="300" />
> Apr 30 11:46:30 node1 cib[1586]:   notice: cib:diff: ++           <op 
> id="my_resource-stop-0" interval="0" name="stop" timeout="300" />
> Apr 30 11:46:30 node1 cib[1586]:   notice: cib:diff: ++         </operations>
> Apr 30 11:46:30 node1 cib[1586]:   notice: cib:diff: ++         
> <instance_attributes id="my_resource-instance_attributes" >
> Apr 30 11:46:30 node1 cib[1586]:   notice: cib:diff: ++           <nvpair 
> id="my_resource-instance_attributes-my_RA" name="my_RA" 
> value="33bb17d2-350b-495f-bd8d-8427baabeed9" />
> Apr 30 11:46:30 node1 cib[1586]:   notice: cib:diff: ++         
> </instance_attributes>
> Apr 30 11:46:30 node1 cib[1586]:   notice: cib:diff: ++       </primitive>
> Apr 30 11:46:30 node1 pengine[1590]:   notice: unpack_config: On loss of CCM 
> Quorum: Ignore
> Apr 30 11:46:30 node1 pengine[1590]:     crit: get_timet_now: Defaulting to 
> 'now'
> Apr 30 11:46:30 node1 pengine[1590]:     crit: get_timet_now: Defaulting to 
> 'now'
> Apr 30 11:46:30 node1 pengine[1590]:     crit: get_timet_now: Defaulting to 
> 'now'
> Apr 30 11:46:30 node1 pengine[1590]:     crit: get_timet_now: Defaulting to 
> 'now'
> Apr 30 11:46:30 node1 pengine[1590]:     crit: get_timet_now: Defaulting to 
> 'now'
> Apr 30 11:46:30 node1 pengine[1590]:   notice: process_pe_message: Calculated 
> Transition 5: /var/lib/pacemaker/pengine/pe-input-10.bz2
> Apr 30 11:46:30 node1 cibadmin[3386]:   notice: crm_log_args: Invoked: 
> cibadmin -o constraints -C -X <rsc_location id="my_resource-primary" 
> node="node1" rsc="my_resource" score="20"/> 
> Apr 30 11:46:30 node1 cib[1586]:   notice: log_cib_diff: cib:diff: Local-only 
> Change: 0.26.1
> Apr 30 11:46:30 node1 cib[1586]:   notice: cib:diff: -- <cib admin_epoch="0" 
> epoch="25" num_updates="1" />
> Apr 30 11:46:30 node1 cib[1586]:   notice: cib:diff: ++       <rsc_location 
> id="my_resource-primary" node="node1" rsc="my_resource" score="20" />
> Apr 30 11:46:33 node1 cib[1586]:   notice: cib:diff: Diff: --- 0.26.3
> Apr 30 11:46:33 node1 cib[1586]:   notice: cib:diff: Diff: +++ 0.27.1 
> 8305c8fe19d06a6204bd04f437eb923a
> Apr 30 11:46:33 node1 cib[1586]:   notice: cib:diff: --         <nvpair 
> value="1367322378" id="cib-bootstrap-options-last-lrm-refresh" />
> Apr 30 11:46:33 node1 cib[1586]:   notice: cib:diff: ++         <nvpair 
> id="cib-bootstrap-options-last-lrm-refresh" name="last-lrm-refresh" 
> value="1367322393" />
> Apr 30 11:46:33 node1 cib[1586]:   notice: cib:diff: Diff: --- 0.27.2
> Apr 30 11:46:33 node1 cib[1586]:   notice: cib:diff: Diff: +++ 0.28.1 
> 0dbddb3084f7cd76bffe21916538be94
> Apr 30 11:46:33 node1 cib[1586]:   notice: cib:diff: --           <nvpair 
> value="Stopped" id="my_resource-meta_attributes-my_RA-role" />
> Apr 30 11:46:33 node1 cib[1586]:   notice: cib:diff: ++           <nvpair 
> name="my_RA-role" id="my_resource-meta_attributes-my_RA-role" value="Started" 
> />
> Apr 30 11:46:33 node1 crmd[1591]:  warning: do_update_resource: Resource 
> my_resource no longer exists in the lrmd
> Apr 30 11:46:33 node1 crmd[1591]:   notice: process_lrm_event: LRM operation 
> my_resource_monitor_0 (call=31, rc=7, cib-update=0, confirmed=true) not 
> running
> Apr 30 11:46:33 node1 crmd[1591]:  warning: decode_transition_key: Bad UUID 
> (crm_resource.c) in sscanf result (4) for 3397:0:0:crm_resource.c
> Apr 30 11:46:33 node1 crmd[1591]:    error: send_msg_via_ipc: Unknown 
> Sub-system (3397_crm_resource)... discarding message.
> Apr 30 11:47:50 node1 crmd[1591]:  warning: action_timer_callback: Timer 
> popped (timeout=20000, abort_level=1000000, complete=false)
> Apr 30 11:47:50 node1 crmd[1591]:    error: print_synapse: [Action    5]: 
> In-flight rsc op my_resource_monitor_0   on node1 (priority: 0, waiting: none)
> Apr 30 11:47:50 node1 crmd[1591]:  warning: cib_action_update: rsc_op 5: 
> my_resource_monitor_0 on node1 timed out
> Apr 30 11:47:50 node1 crmd[1591]:   notice: run_graph: Transition 5 
> (Complete=4, Pending=0, Fired=0, Skipped=1, Incomplete=0, 
> Source=/var/lib/pacemaker/pengine/pe-input-10.bz2): Stopped
> Apr 30 11:47:50 node1 pengine[1590]:   notice: unpack_config: On loss of CCM 
> Quorum: Ignore
> Apr 30 11:47:50 node1 pengine[1590]:     crit: get_timet_now: Defaulting to 
> 'now'
> Apr 30 11:47:50 node1 pengine[1590]:     crit: get_timet_now: Defaulting to 
> 'now'
> Apr 30 11:47:50 node1 pengine[1590]:     crit: get_timet_now: Defaulting to 
> 'now'
> Apr 30 11:47:50 node1 pengine[1590]:  warning: unpack_rsc_op: Processing 
> failed op monitor for my_resource on node1: unknown error (1)
> Apr 30 11:47:50 node1 pengine[1590]:     crit: get_timet_now: Defaulting to 
> 'now'
> Apr 30 11:47:50 node1 pengine[1590]:     crit: get_timet_now: Defaulting to 
> 'now'
> Apr 30 11:47:50 node1 pengine[1590]:   notice: LogActions: Recover 
> my_resource#011(Started node1)
> Apr 30 11:47:50 node1 pengine[1590]:   notice: process_pe_message: Calculated 
> Transition 6: /var/lib/pacemaker/pengine/pe-input-11.bz2
> Apr 30 11:47:52 node1 cib[1586]:   notice: cib:diff: Diff: --- 0.28.5
> Apr 30 11:47:52 node1 cib[1586]:   notice: cib:diff: Diff: +++ 0.29.1 
> c79afeec7d8f0d54fb425821e4e910f9
> Apr 30 11:47:52 node1 cib[1586]:   notice: cib:diff: --           <nvpair 
> value="Started" id="my_resource-meta_attributes-my_RA-role" />
> Apr 30 11:47:52 node1 cib[1586]:   notice: cib:diff: ++           <nvpair 
> name="my_RA-role" id="my_resource-meta_attributes-my_RA-role" value="Stopped" 
> />
> Apr 30 11:47:53 node1 crmd[1591]:   notice: run_graph: Transition 6 
> (Complete=2, Pending=0, Fired=0, Skipped=5, Incomplete=0, 
> Source=/var/lib/pacemaker/pengine/pe-input-11.bz2): Stopped
> Apr 30 11:47:53 node1 pengine[1590]:   notice: unpack_config: On loss of CCM 
> Quorum: Ignore
> Apr 30 11:47:53 node1 pengine[1590]:     crit: get_timet_now: Defaulting to 
> 'now'
> Apr 30 11:47:53 node1 pengine[1590]:     crit: get_timet_now: Defaulting to 
> 'now'
> Apr 30 11:47:53 node1 pengine[1590]:     crit: get_timet_now: Defaulting to 
> 'now'
> Apr 30 11:47:53 node1 pengine[1590]:  warning: unpack_rsc_op: Processing 
> failed op monitor for my_resource on node1: unknown error (1)
> Apr 30 11:47:53 node1 pengine[1590]:     crit: get_timet_now: Defaulting to 
> 'now'
> Apr 30 11:47:53 node1 pengine[1590]:     crit: get_timet_now: Defaulting to 
> 'now'
> Apr 30 11:47:53 node1 pengine[1590]:     crit: get_timet_now: Defaulting to 
> 'now'
> Apr 30 11:47:53 node1 pengine[1590]:   notice: LogActions: Stop    
> my_resource#011(node1)
> Apr 30 11:47:53 node1 pengine[1590]:   notice: process_pe_message: Calculated 
> Transition 7: /var/lib/pacemaker/pengine/pe-input-12.bz2
> Apr 30 11:47:57 node1 crmd[1591]:   notice: process_lrm_event: LRM operation 
> my_resource_stop_0 (call=38, rc=0, cib-update=89, confirmed=true) ok
> Apr 30 11:47:57 node1 crmd[1591]:   notice: run_graph: Transition 7 
> (Complete=2, Pending=0, Fired=0, Skipped=0, Incomplete=0, 
> Source=/var/lib/pacemaker/pengine/pe-input-12.bz2): Complete
> Apr 30 11:47:57 node1 crmd[1591]:   notice: do_state_transition: State 
> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS 
> cause=C_FSA_INTERNAL origin=notify_crmd ]
> Apr 30 11:50:10 node1 crmd[1591]:   notice: do_state_transition: State 
> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL 
> origin=abort_transition_graph ]
> Apr 30 11:50:10 node1 cib[1586]:   notice: cib:diff: Diff: --- 0.29.3
> Apr 30 11:50:10 node1 cib[1586]:   notice: cib:diff: Diff: +++ 0.30.1 
> 628db64f8576694a86a58110e174731d
> Apr 30 11:50:10 node1 cib[1586]:   notice: cib:diff: --           <nvpair 
> value="Started" id="MGS_18e873-meta_attributes-my_RA-role" />
> Apr 30 11:50:10 node1 cib[1586]:   notice: cib:diff: ++           <nvpair 
> name="my_RA-role" id="MGS_18e873-meta_attributes-my_RA-role" value="Stopped" 
> />
> Apr 30 11:50:10 node1 pengine[1590]:   notice: unpack_config: On loss of CCM 
> Quorum: Ignore
> Apr 30 11:50:10 node1 pengine[1590]:     crit: get_timet_now: Defaulting to 
> 'now'
> Apr 30 11:50:10 node1 pengine[1590]:     crit: get_timet_now: Defaulting to 
> 'now'
> Apr 30 11:50:10 node1 pengine[1590]:     crit: get_timet_now: Defaulting to 
> 'now'
> Apr 30 11:50:10 node1 pengine[1590]:  warning: unpack_rsc_op: Processing 
> failed op monitor for my_resource on node1: unknown error (1)
> Apr 30 11:50:10 node1 pengine[1590]:     crit: get_timet_now: Defaulting to 
> 'now'
> Apr 30 11:50:10 node1 pengine[1590]:     crit: get_timet_now: Defaulting to 
> 'now'
> 
> I'm wondering what:
> 
> warning: unpack_rsc_op: Processing failed op monitor for my_resource on 
> node1: unknown error (1)
> 
> really means.  Is that that the RA ran (i.e. did it's job) but didn't
> return something pacemaker was expecting

Normally it means the RA returned 1 (aka. OCF_ERR_GENERIC):
   
http://clusterlabs.org/doc/en-US/Pacemaker/1.1-crmsh/html/Pacemaker_Explained/s-ocf-return-codes.html

> or was there a problem even
> trying to run the RA.

It might be possible, but I'd expect some more errors in that case.

> In the case of this RA the monitor operation only takes a few seconds at
> most and as you can see, I give it 60 seconds so it's difficult to
> understand how this:
> 
> Apr 30 11:47:50 node1 crmd[1591]:  warning: cib_action_update: rsc_op 5: 
> my_resource_monitor_0 on node1 timed out
> 
> can happen.

Could it be blocked on I/O or similar?
I've seen even IPaddr take ridiculously long on a sufficiently overloaded 
system.

> 
> Additionally, I have my RA rigged to log

Can we see it?

> and provide debugging for
> every invocation and I see no evidence of the RA being run at that
> time.
> 
> Any thoughts?
> 
> b.
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to