I have created bug 2580. I was not sure so I filed it against "CRM-core".
I have a potentially crazy theory about a PROBE occurring while the agent is doing a START of the resource and the PROBE getting an error LRM_RSC_BUSY but I am not sure. The "msg_to_op() failed to get the value of field lrm_opstatus from a ha_msg" seems to occur every time after LRMD prints LRM_RSC_BUSY. I will keep digging. Thanks, Bob ________________________________ From: Dejan Muhamedagic <[email protected]> To: [email protected] Sent: Tue, April 5, 2011 7:36:36 AM Subject: Re: [Linux-HA] Fw: [Pacemaker] WARN: msg_to_op(1324): failed to get the value of field lrm_opstatus from a ha_msg On Tue, Apr 05, 2011 at 06:47:03AM -0700, Bob Schatz wrote: > Thanks Dejan for your help. Comments below > > >> get the value of field lrm_opstatus from a ha_msg > >> 2011-03-24 18:53:12| info |crmd: [27913]: info: msg_to_op: Message > >> follows: > >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG: Dumping message with > >> 16 > > >> fields > >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[0] : [lrm_t=op] > >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[1] : > >> [lrm_rid=SSJ0000E02A2:0] > >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[2] : [lrm_op=start] > >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[3] : > [lrm_timeout=300000] > >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[4] : [lrm_interval=0] > >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[5] : [lrm_delay=0] > >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[6] : > >> [lrm_copyparams=1] > >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[7] : [lrm_t_run=0] > >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[8] : > >> [lrm_t_rcchange=0] > >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[9] : [lrm_exec_time=0] > >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[10] : > >> [lrm_queue_time=0] > > >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[11] : > >> [lrm_targetrc=-1] > >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[12] : [lrm_app=crmd] > >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[13] : > >> [lrm_userdata=91:3:0:dc9ad1c7-1d74-4418-a002-34426b34b576] > >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[14] : > >> [(2)lrm_param=0x64c230(938 1098)] > >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG: Dumping message with > >> 27 > > >> fields > >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[0] : > >> [CRM_meta_clone=0] > >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[1] : > >> [CRM_meta_notify_slave_resource= ] > >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[2] : > >> [CRM_meta_notify_active_resource= ] > >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[3] : > >> [CRM_meta_notify_demote_uname= ] > >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[4] : > >> [CRM_meta_notify_inactive_resource=SSJ0000E02A2:0 SSJ0000E02A2:1 ] > >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[5] : > >> [ssconf=/var/omneon/config/config.J0000E02A2] > >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[6] : > >> [CRM_meta_master_node_max=1] > >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[7] : > >> [CRM_meta_notify_stop_resource= ] > >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[8] : > >> [CRM_meta_notify_master_resource= ] > >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[9] : > >> [CRM_meta_clone_node_max=1] > >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[10] : > >> [CRM_meta_clone_max=2] > >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[11] : > >> [CRM_meta_notify=true] > >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[12] : > >> [CRM_meta_notify_start_resource=SSJ0000E02A2:0 SSJ0000E02A2:1 ] > >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[13] : > >> [CRM_meta_notify_stop_uname= ] > >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[14] : > >> [crm_feature_set=3.0.1] > >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[15] : > >> [CRM_meta_notify_master_uname= ] > >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[16] : > >> [CRM_meta_master_max=1] > >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[17] : > >> [CRM_meta_globally_unique=false] > >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[18] : > >> [CRM_meta_notify_promote_resource=SSJ0000E02A2:0 ] > >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[19] : > >> [CRM_meta_notify_promote_uname=mgraid-s0000e02a1-0 ] > >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[20] : > >> [CRM_meta_notify_active_uname= ] > >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[21] : > >> [CRM_meta_notify_start_uname=mgraid-s0000e02a1-0 mgraid-s0000e02a1-1 ] > >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[22] : > >> [CRM_meta_notify_slave_uname= ] > >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[23] : > >> [CRM_meta_name=start] > >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[24] : > >> [ss_resource=SSJ0000E02A2] > >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[25] : > >> [CRM_meta_notify_demote_resource= ] > >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[26] : > >> [CRM_meta_timeout=300000] > >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[15] : [lrm_callid=15] > >> > >> This results in the resources being stopped even though I can see from the > >> logging that the agent START function returned $OCF_SUCCESS. (The agent > start > >> function prints "ss_start() START" and "ss_start() END" in the logging). > >> > >> The START function can take anywhere from 30 - 60 seconds to complete due > >> to > > >>our > >> > >> > >> > >> application. > >> > >> I am running with 1.0.9 Pacemaker and heartbeat 3.0.3. > >> > >> I have attached the configuration as a file to this email since I thought > >> it > > >> would make the email unreadable. (Summary is 6 master/slave resources). > >> > >> I have also attached logs . The above messages are from the file > n0-short.txt > >> but also occur in n1-short.txt. > >> > >> I thought that maybe I was running into a problem with the number of > >> threads > > >> that lrmd had configured. I increased in to 40 and proved that it was in > >> affect with: > >> > >> # /sbin/lrmadmin -g max-children > >> max-children: 40 > >> > >> This problem is reproducible every time. > > > >The missing lrm_opstatus field is due to the operation never > >being run hence no status to report. Perhaps this particular > >case should have severity reduced to info. > > > >Did you observe any adverse effects otherwise? > > > >Thanks, > > > >Dejan > > The agent START function was called as seen by the "ss_start() START" and > "ss_start() END" messages. The return value $OCF_SUCCESS appears to have > been > > lost somehow. Hmm, that's a serious problem, if it really happened. > The result is that the resource was stopped and I could not restart it even > after clearing the failure counts with "crm_resource -C". That's really strange, normally "crm resource cleanup" should always result in resource trying to start again. Best to open a bugzilla with hb_report attached. Thanks, Dejan > > Thanks, > > Bob > > > > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
