Dejan: looks like something in the lrm library. Any idea why the message doesn't contain lrm_opstatus? lrm_targetrc also looks strange.
On Thu, Sep 30, 2010 at 9:41 PM, Ron Kerry <rke...@sgi.com> wrote: > Folks - > > I am seeing the following message sequence that results in a "bogus" > declaration of monitor failures for two resources very quickly after a > failover completes (from hendrix to genesis) with all resources coming up. > The scenario is the same for both resources. > > CXFS resource monitor invoked after a successful start but the response is > "faked" likely due to the start-delay defined for monitoring. > > Sep 30 10:23:33 genesis crmd: [12176]: info: te_rsc_command: Initiating > action 8: monitor CXFS_monitor_30000 on genesis (local) > Sep 30 10:23:33 genesis crmd: [12176]: info: do_lrm_rsc_op: Performing > key=8:1:0:68ec92a5-8ae4-4b71-a37b-f348916fba9c op=CXFS_monitor_30000 ) > Sep 30 10:23:33 genesis lrmd: [12173]: debug: on_msg_perform_op: an > operation operation monitor[15] on ocf::cxfs::CXFS for client 12176, its > parameters: CRM_meta_name=[monitor] CRM_meta_start_delay=[600000] > crm_feature_set=[3.0.2] CRM_meta_on_fail=[restar.. > Sep 30 10:23:33 genesis crmd: [12176]: info: do_lrm_rsc_op: Faking > confirmation of CXFS_monitor_30000: execution postponed for over 5 minutes > Sep 30 10:23:33 genesis crmd: [12176]: info: send_direct_ack: ACK'ing > resource op CXFS_monitor_30000 from > 8:1:0:68ec92a5-8ae4-4b71-a37b-f348916fba9c: lrm_invoke-lrmd-1285860213-14 > Sep 30 10:23:33 genesis crmd: [12176]: info: process_te_message: Processing > (N)ACK lrm_invoke-lrmd-1285860213-14 from genesis > Sep 30 10:23:33 genesis crmd: [12176]: info: match_graph_event: Action > CXFS_monitor_30000 (8) confirmed on genesis (rc=0) > > Similar sequence for the TMF resource ... > > Sep 30 10:23:44 genesis crmd: [12176]: info: te_rsc_command: Initiating > action 12: monitor TMF_monitor_60000 on genesis (local) > Sep 30 10:23:44 genesis crmd: [12176]: info: do_lrm_rsc_op: Performing > key=12:1:0:68ec92a5-8ae4-4b71-a37b-f348916fba9c op=TMF_monitor_60000 ) > Sep 30 10:23:44 genesis lrmd: [12173]: debug: on_msg_perform_op: an > operation operation monitor[19] on ocf::tmf::TMF for client 12176, its > parameters: admin_emails=[rke...@sgi.com] loader_hosts=[ibm3494cps] > devgrpnames=[ibm3592] loader_names=[ibm3494] loa... > Sep 30 10:23:44 genesis crmd: [12176]: info: do_lrm_rsc_op: Faking > confirmation of TMF_monitor_60000: execution postponed for over 5 minutes > Sep 30 10:23:44 genesis crmd: [12176]: info: send_direct_ack: ACK'ing > resource op TMF_monitor_60000 from > 12:1:0:68ec92a5-8ae4-4b71-a37b-f348916fba9c: lrm_invoke-lrmd-1285860224-19 > Sep 30 10:23:44 genesis crmd: [12176]: info: process_te_message: Processing > (N)ACK lrm_invoke-lrmd-1285860224-19 from genesis > Sep 30 10:23:44 genesis crmd: [12176]: info: match_graph_event: Action > TMF_monitor_60000 (12) confirmed on genesis (rc=0) > > TMF monitor operation state gets an error. Note that the operation id > matches the above invocation. > > Sep 30 10:26:12 genesis lrmd: [12173]: debug: on_msg_get_state:state of rsc > TMF is LRM_RSC_IDLE > Sep 30 10:26:12 genesis crmd: [12176]: WARN: msg_to_op(1326): failed to get > the value of field lrm_opstatus from a ha_msg > Sep 30 10:26:12 genesis crmd: [12176]: info: msg_to_op: Message follows: > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG: Dumping message with 16 > fields > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[0] : [lrm_t=op] > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[1] : [lrm_rid=TMF] > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[2] : [lrm_op=monitor] > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[3] : [lrm_timeout=630000] > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[4] : [lrm_interval=60000] > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[5] : [lrm_delay=600000] > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[6] : [lrm_copyparams=0] > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[7] : [lrm_t_run=0] > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[8] : [lrm_t_rcchange=0] > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[9] : [lrm_exec_time=0] > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[10] : [lrm_queue_time=0] > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[11] : [lrm_targetrc=-2] > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[12] : [lrm_app=crmd] > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[13] : > [lrm_userdata=12:1:0:68ec92a5-8ae4-4b71-a37b-f348916fba9c] > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[14] : > [(2)lrm_param=0x6000000000081260(331 413)] > > Same for the CXFS monitor operation state ... > > Sep 30 10:26:12 genesis lrmd: [12173]: debug: on_msg_get_state:state of rsc > CXFS is LRM_RSC_IDLE > Sep 30 10:26:12 genesis crmd: [12176]: WARN: msg_to_op(1326): failed to get > the value of field lrm_opstatus from a ha_msg > Sep 30 10:26:12 genesis crmd: [12176]: info: msg_to_op: Message follows: > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG: Dumping message with 16 > fields > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[0] : [lrm_t=op] > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[1] : [lrm_rid=CXFS] > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[2] : [lrm_op=monitor] > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[3] : [lrm_timeout=630000] > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[4] : [lrm_interval=30000] > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[5] : [lrm_delay=600000] > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[6] : [lrm_copyparams=0] > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[7] : [lrm_t_run=0] > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[8] : [lrm_t_rcchange=0] > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[9] : [lrm_exec_time=0] > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[10] : [lrm_queue_time=0] > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[11] : [lrm_targetrc=-2] > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[12] : [lrm_app=crmd] > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[13] : > [lrm_userdata=8:1:0:68ec92a5-8ae4-4b71-a37b-f348916fba9c] > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[14] : > [(2)lrm_param=0x6000000000030320(214 254)] > > In both cases, the lrm_opstatus field appears to be missing!! > > This results in updating the fail-count and last-failure values for these > resources. > > Sep 30 10:26:12 genesis attrd: [12174]: info: attrd_trigger_update: Sending > flush op to all hosts for: fail-count-TMF (<null>) > Sep 30 10:26:12 genesis attrd: [12174]: info: attrd_trigger_update: Sending > flush op to all hosts for: last-failure-TMF (<null>) > Sep 30 10:26:12 genesis attrd: [12174]: info: attrd_trigger_update: Sending > flush op to all hosts for: last-failure-CXFS (<null>) > Sep 30 10:26:12 genesis attrd: [12174]: info: attrd_trigger_update: Sending > flush op to all hosts for: fail-count-CXFS (<null>) > > This causes pengine to failover the resources thinking there was a monitor > failure when there was not. > > Sep 30 10:26:12 genesis pengine: [12175]: notice: LogActions: Move resource > CXFS (Started genesis -> hendrix) > Sep 30 10:26:12 genesis pengine: [12175]: notice: LogActions: Move resource > VirtualIP (Started genesis -> hendrix) > Sep 30 10:26:12 genesis pengine: [12175]: notice: LogActions: Move resource > TMF (Started genesis -> hendrix) > Sep 30 10:26:12 genesis pengine: [12175]: notice: LogActions: Move resource > DMF (Started genesis -> hendrix) > Sep 30 10:26:12 genesis pengine: [12175]: notice: LogActions: Move resource > DMFMAN (Started genesis -> hendrix) > Sep 30 10:26:12 genesis pengine: [12175]: notice: LogActions: Move resource > DMFSOAP (Started genesis -> hendrix) > > -- > > Ron Kerry rke...@sgi.com > Field Technical Support - SGI Federal > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker > _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker