On Thu, Oct 07, 2010 at 09:49:05AM +0200, Andrew Beekhof wrote: > On Tue, Oct 5, 2010 at 1:50 PM, Dejan Muhamedagic <[email protected]> wrote: > > Hi, > > > > On Tue, Oct 05, 2010 at 11:18:37AM +0200, Andrew Beekhof wrote: > >> Dejan: looks like something in the lrm library. > >> Any idea why the message doesn't contain lrm_opstatus? > > > > Becase this monitor operation never run. Which seems to be a > > plausible explanation since the start-delay is set to 600s. > > Isn't that what LRM_OP_PENDING is for? > I'm happy to see that at least msg_to_op() maps missing fields to that value > :-)
Actually it does, it's just that the library code logs the warning and then the whole message. The missing op_status is then set to LRM_OP_PENDING. BTW, using start-delay means that there's a deficiency in the RA. That attribute should be banned. Thanks, Dejan > Looks like the crmd needs some extra logic for these deferred actions. > Can you put the full logs somewhere, I'm missing the context of what > the crmd is doing when it gets these messages. > >> lrm_targetrc also looks strange. > > > > That translates to CHANGED, i.e. notify the requester only on > > status change. That should be what the requester itself sent with > > the operation create request. > > > > Thanks, > > > > Dejan > > > >> On Thu, Sep 30, 2010 at 9:41 PM, Ron Kerry <[email protected]> wrote: > >> > Folks - > >> > > >> > I am seeing the following message sequence that results in a "bogus" > >> > declaration of monitor failures for two resources very quickly after a > >> > failover completes (from hendrix to genesis) with all resources coming > >> > up. > >> > The scenario is the same for both resources. > >> > > >> > CXFS resource monitor invoked after a successful start but the response > >> > is > >> > "faked" likely due to the start-delay defined for monitoring. > >> > > >> > Sep 30 10:23:33 genesis crmd: [12176]: info: te_rsc_command: Initiating > >> > action 8: monitor CXFS_monitor_30000 on genesis (local) > >> > Sep 30 10:23:33 genesis crmd: [12176]: info: do_lrm_rsc_op: Performing > >> > key=8:1:0:68ec92a5-8ae4-4b71-a37b-f348916fba9c op=CXFS_monitor_30000 ) > >> > Sep 30 10:23:33 genesis lrmd: [12173]: debug: on_msg_perform_op: an > >> > operation operation monitor[15] on ocf::cxfs::CXFS for client 12176, its > >> > parameters: CRM_meta_name=[monitor] CRM_meta_start_delay=[600000] > >> > crm_feature_set=[3.0.2] CRM_meta_on_fail=[restar.. > >> > Sep 30 10:23:33 genesis crmd: [12176]: info: do_lrm_rsc_op: Faking > >> > confirmation of CXFS_monitor_30000: execution postponed for over 5 > >> > minutes > >> > Sep 30 10:23:33 genesis crmd: [12176]: info: send_direct_ack: ACK'ing > >> > resource op CXFS_monitor_30000 from > >> > 8:1:0:68ec92a5-8ae4-4b71-a37b-f348916fba9c: lrm_invoke-lrmd-1285860213-14 > >> > Sep 30 10:23:33 genesis crmd: [12176]: info: process_te_message: > >> > Processing > >> > (N)ACK lrm_invoke-lrmd-1285860213-14 from genesis > >> > Sep 30 10:23:33 genesis crmd: [12176]: info: match_graph_event: Action > >> > CXFS_monitor_30000 (8) confirmed on genesis (rc=0) > >> > > >> > Similar sequence for the TMF resource ... > >> > > >> > Sep 30 10:23:44 genesis crmd: [12176]: info: te_rsc_command: Initiating > >> > action 12: monitor TMF_monitor_60000 on genesis (local) > >> > Sep 30 10:23:44 genesis crmd: [12176]: info: do_lrm_rsc_op: Performing > >> > key=12:1:0:68ec92a5-8ae4-4b71-a37b-f348916fba9c op=TMF_monitor_60000 ) > >> > Sep 30 10:23:44 genesis lrmd: [12173]: debug: on_msg_perform_op: an > >> > operation operation monitor[19] on ocf::tmf::TMF for client 12176, its > >> > parameters: admin_emails=[[email protected]] loader_hosts=[ibm3494cps] > >> > devgrpnames=[ibm3592] loader_names=[ibm3494] loa... > >> > Sep 30 10:23:44 genesis crmd: [12176]: info: do_lrm_rsc_op: Faking > >> > confirmation of TMF_monitor_60000: execution postponed for over 5 minutes > >> > Sep 30 10:23:44 genesis crmd: [12176]: info: send_direct_ack: ACK'ing > >> > resource op TMF_monitor_60000 from > >> > 12:1:0:68ec92a5-8ae4-4b71-a37b-f348916fba9c: > >> > lrm_invoke-lrmd-1285860224-19 > >> > Sep 30 10:23:44 genesis crmd: [12176]: info: process_te_message: > >> > Processing > >> > (N)ACK lrm_invoke-lrmd-1285860224-19 from genesis > >> > Sep 30 10:23:44 genesis crmd: [12176]: info: match_graph_event: Action > >> > TMF_monitor_60000 (12) confirmed on genesis (rc=0) > >> > > >> > TMF monitor operation state gets an error. Note that the operation id > >> > matches the above invocation. > >> > > >> > Sep 30 10:26:12 genesis lrmd: [12173]: debug: on_msg_get_state:state of > >> > rsc > >> > TMF is LRM_RSC_IDLE > >> > Sep 30 10:26:12 genesis crmd: [12176]: WARN: msg_to_op(1326): failed to > >> > get > >> > the value of field lrm_opstatus from a ha_msg > >> > Sep 30 10:26:12 genesis crmd: [12176]: info: msg_to_op: Message follows: > >> > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG: Dumping message with 16 > >> > fields > >> > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[0] : [lrm_t=op] > >> > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[1] : [lrm_rid=TMF] > >> > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[2] : [lrm_op=monitor] > >> > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[3] : > >> > [lrm_timeout=630000] > >> > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[4] : > >> > [lrm_interval=60000] > >> > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[5] : [lrm_delay=600000] > >> > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[6] : [lrm_copyparams=0] > >> > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[7] : [lrm_t_run=0] > >> > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[8] : [lrm_t_rcchange=0] > >> > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[9] : [lrm_exec_time=0] > >> > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[10] : [lrm_queue_time=0] > >> > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[11] : [lrm_targetrc=-2] > >> > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[12] : [lrm_app=crmd] > >> > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[13] : > >> > [lrm_userdata=12:1:0:68ec92a5-8ae4-4b71-a37b-f348916fba9c] > >> > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[14] : > >> > [(2)lrm_param=0x6000000000081260(331 413)] > >> > > >> > Same for the CXFS monitor operation state ... > >> > > >> > Sep 30 10:26:12 genesis lrmd: [12173]: debug: on_msg_get_state:state of > >> > rsc > >> > CXFS is LRM_RSC_IDLE > >> > Sep 30 10:26:12 genesis crmd: [12176]: WARN: msg_to_op(1326): failed to > >> > get > >> > the value of field lrm_opstatus from a ha_msg > >> > Sep 30 10:26:12 genesis crmd: [12176]: info: msg_to_op: Message follows: > >> > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG: Dumping message with 16 > >> > fields > >> > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[0] : [lrm_t=op] > >> > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[1] : [lrm_rid=CXFS] > >> > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[2] : [lrm_op=monitor] > >> > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[3] : > >> > [lrm_timeout=630000] > >> > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[4] : > >> > [lrm_interval=30000] > >> > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[5] : [lrm_delay=600000] > >> > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[6] : [lrm_copyparams=0] > >> > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[7] : [lrm_t_run=0] > >> > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[8] : [lrm_t_rcchange=0] > >> > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[9] : [lrm_exec_time=0] > >> > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[10] : [lrm_queue_time=0] > >> > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[11] : [lrm_targetrc=-2] > >> > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[12] : [lrm_app=crmd] > >> > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[13] : > >> > [lrm_userdata=8:1:0:68ec92a5-8ae4-4b71-a37b-f348916fba9c] > >> > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[14] : > >> > [(2)lrm_param=0x6000000000030320(214 254)] > >> > > >> > In both cases, the lrm_opstatus field appears to be missing!! > >> > > >> > This results in updating the fail-count and last-failure values for these > >> > resources. > >> > > >> > Sep 30 10:26:12 genesis attrd: [12174]: info: attrd_trigger_update: > >> > Sending > >> > flush op to all hosts for: fail-count-TMF (<null>) > >> > Sep 30 10:26:12 genesis attrd: [12174]: info: attrd_trigger_update: > >> > Sending > >> > flush op to all hosts for: last-failure-TMF (<null>) > >> > Sep 30 10:26:12 genesis attrd: [12174]: info: attrd_trigger_update: > >> > Sending > >> > flush op to all hosts for: last-failure-CXFS (<null>) > >> > Sep 30 10:26:12 genesis attrd: [12174]: info: attrd_trigger_update: > >> > Sending > >> > flush op to all hosts for: fail-count-CXFS (<null>) > >> > > >> > This causes pengine to failover the resources thinking there was a > >> > monitor > >> > failure when there was not. > >> > > >> > Sep 30 10:26:12 genesis pengine: [12175]: notice: LogActions: Move > >> > resource > >> > CXFS (Started genesis -> hendrix) > >> > Sep 30 10:26:12 genesis pengine: [12175]: notice: LogActions: Move > >> > resource > >> > VirtualIP (Started genesis -> hendrix) > >> > Sep 30 10:26:12 genesis pengine: [12175]: notice: LogActions: Move > >> > resource > >> > TMF (Started genesis -> hendrix) > >> > Sep 30 10:26:12 genesis pengine: [12175]: notice: LogActions: Move > >> > resource > >> > DMF (Started genesis -> hendrix) > >> > Sep 30 10:26:12 genesis pengine: [12175]: notice: LogActions: Move > >> > resource > >> > DMFMAN (Started genesis -> hendrix) > >> > Sep 30 10:26:12 genesis pengine: [12175]: notice: LogActions: Move > >> > resource > >> > DMFSOAP (Started genesis -> hendrix) > >> > > >> > -- > >> > > >> > Ron Kerry [email protected] > >> > Field Technical Support - SGI Federal > >> > > >> > _______________________________________________ > >> > Pacemaker mailing list: [email protected] > >> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > >> > > >> > Project Home: http://www.clusterlabs.org > >> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > >> > Bugs: > >> > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker > >> > > >> > >> _______________________________________________ > >> Pacemaker mailing list: [email protected] > >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker > >> > >> Project Home: http://www.clusterlabs.org > >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > >> Bugs: > >> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker > > > > _______________________________________________ > > Pacemaker mailing list: [email protected] > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > Project Home: http://www.clusterlabs.org > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: > > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker > > > > _______________________________________________ > Pacemaker mailing list: [email protected] > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker _______________________________________________ Pacemaker mailing list: [email protected] http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
