Re: [Linux-HA] Fw: [Pacemaker] WARN: msg_to_op(1324): failed to get the value of field lrm_opstatus from a ha_msg

Dejan Muhamedagic Tue, 05 Apr 2011 07:37:03 -0700

On Tue, Apr 05, 2011 at 06:47:03AM -0700, Bob Schatz wrote:
> Thanks Dejan for your help.  Comments below
> 
> >> get the value of field lrm_opstatus from a ha_msg 
> >> 2011-03-24 18:53:12| info |crmd: [27913]: info: msg_to_op: Message 
> >> follows: 
> >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG: Dumping message with 
> >> 16 
> >> fields 
> >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[0] : [lrm_t=op] 
> >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[1] : 
> >> [lrm_rid=SSJ0000E02A2:0] 
> >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[2] : [lrm_op=start] 
> >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[3] : 
> [lrm_timeout=300000] 
> >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[4] : [lrm_interval=0] 
> >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[5] : [lrm_delay=0] 
> >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[6] : 
> >> [lrm_copyparams=1] 
> >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[7] : [lrm_t_run=0] 
> >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[8] : 
> >> [lrm_t_rcchange=0] 
> >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[9] : [lrm_exec_time=0] 
> >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[10] : 
> >> [lrm_queue_time=0] 
> >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[11] : 
> >> [lrm_targetrc=-1] 
> >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[12] : [lrm_app=crmd] 
> >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[13] : 
> >> [lrm_userdata=91:3:0:dc9ad1c7-1d74-4418-a002-34426b34b576] 
> >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[14] : 
> >> [(2)lrm_param=0x64c230(938 1098)] 
> >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG: Dumping message with 
> >> 27 
> >> fields 
> >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[0] : 
> >> [CRM_meta_clone=0] 
> >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[1] : 
> >> [CRM_meta_notify_slave_resource= ] 
> >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[2] : 
> >> [CRM_meta_notify_active_resource= ] 
> >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[3] : 
> >> [CRM_meta_notify_demote_uname= ] 
> >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[4] : 
> >> [CRM_meta_notify_inactive_resource=SSJ0000E02A2:0 SSJ0000E02A2:1 ] 
> >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[5] : 
> >> [ssconf=/var/omneon/config/config.J0000E02A2] 
> >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[6] : 
> >> [CRM_meta_master_node_max=1] 
> >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[7] : 
> >> [CRM_meta_notify_stop_resource= ] 
> >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[8] : 
> >> [CRM_meta_notify_master_resource= ] 
> >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[9] : 
> >> [CRM_meta_clone_node_max=1] 
> >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[10] : 
> >> [CRM_meta_clone_max=2] 
> >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[11] : 
> >> [CRM_meta_notify=true] 
> >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[12] : 
> >> [CRM_meta_notify_start_resource=SSJ0000E02A2:0 SSJ0000E02A2:1 ] 
> >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[13] : 
> >> [CRM_meta_notify_stop_uname= ] 
> >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[14] : 
> >> [crm_feature_set=3.0.1] 
> >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[15] : 
> >> [CRM_meta_notify_master_uname= ] 
> >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[16] : 
> >> [CRM_meta_master_max=1] 
> >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[17] : 
> >> [CRM_meta_globally_unique=false] 
> >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[18] : 
> >> [CRM_meta_notify_promote_resource=SSJ0000E02A2:0 ] 
> >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[19] : 
> >> [CRM_meta_notify_promote_uname=mgraid-s0000e02a1-0 ] 
> >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[20] : 
> >> [CRM_meta_notify_active_uname= ] 
> >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[21] : 
> >> [CRM_meta_notify_start_uname=mgraid-s0000e02a1-0 mgraid-s0000e02a1-1 ] 
> >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[22] : 
> >> [CRM_meta_notify_slave_uname= ] 
> >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[23] : 
> >> [CRM_meta_name=start] 
> >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[24] : 
> >> [ss_resource=SSJ0000E02A2] 
> >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[25] : 
> >> [CRM_meta_notify_demote_resource= ] 
> >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[26] : 
> >> [CRM_meta_timeout=300000] 
> >> 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[15] : [lrm_callid=15] 
> >> 
> >> This results in the resources being stopped even though I can see from the 
> >> logging that the agent START function returned $OCF_SUCCESS. (The agent 
> start 
> >> function prints "ss_start() START" and "ss_start() END" in the logging). 
> >> 
> >> The START function can take anywhere from 30 - 60 seconds to complete due 
> >> to 
> >>our 
> >> 
> >> 
> >> 
> >> application. 
> >> 
> >> I am running with 1.0.9 Pacemaker and heartbeat 3.0.3. 
> >> 
> >> I have attached the configuration as a file to this email since I thought 
> >> it 
> >> would make the email unreadable. (Summary is 6 master/slave resources). 
> >> 
> >> I have also attached logs . The above messages are from the file 
> n0-short.txt 
> >> but also occur in n1-short.txt. 
> >> 
> >> I thought that maybe I was running into a problem with the number of 
> >> threads 
> >> that lrmd had configured. I increased in to 40 and proved that it was in 
> >> affect with: 
> >> 
> >> # /sbin/lrmadmin -g max-children 
> >> max-children: 40 
> >> 
> >> This problem is reproducible every time. 
> >
> >The missing lrm_opstatus field is due to the operation never 
> >being run hence no status to report. Perhaps this particular 
> >case should have severity reduced to info. 
> >
> >Did you observe any adverse effects otherwise? 
> >
> >Thanks, 
> >
> >Dejan 
> 
> The agent START function was called as seen by the "ss_start() START" and 
> "ss_start() END" messages.   The return value $OCF_SUCCESS appears to have 
> been 
> lost somehow.


Hmm, that's a serious problem, if it really happened.

> The result is that the resource was stopped and I could not restart it even 
> after clearing the failure counts with "crm_resource -C".

That's really strange, normally "crm resource cleanup" should
always result in resource trying to start again.

Best to open a bugzilla with hb_report attached.

Thanks,

Dejan

> 
> Thanks,
> 
> Bob
> 
> 
>       
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Fw: [Pacemaker] WARN: msg_to_op(1324): failed to get the value of field lrm_opstatus from a ha_msg

Reply via email to