On Tue, Mar 29, 2011 at 05:14:39PM -0700, Bob Schatz wrote:
> I am forwarding this to the LinuxHA mailing list in case I emailed it to the 
> wrong list.

Sometimes it takes time ...

> A few more thoughts that occurred after I hit <return>
> 
> 1.  This problem sees to only occur when "/etc/init.d/heartbeat start" is 
> executed on two nodes at the same time.  If I only do one at a time it does 
> not 
> seem to occur.  (this may be related to the creation of master/slave 
> resources 
> in /etc/ha.d/resource.d/startstop when heartbeat starts)
> 2.  This problem seemed to occur most frequently when I went from 4 
> master/slave 
> 
> resources to 6 master/slave resources.

OK. Interesting.

> Thanks,
> 
> Bob
> 
> 
> ----- Original Message ----
> From: Bob Schatz <[email protected]>
> To: The Pacemaker cluster resource manager <[email protected]>
> Sent: Fri, March 25, 2011 4:22:39 PM
> Subject: Re: [Pacemaker] WARN: msg_to_op(1324): failed to get the value of 
> field 
> 
> lrm_opstatus from a ha_msg
> 
> After reading more threads, I noticed that I needed to include the PE outputs.
> 
> Therefore, I have rerun the tests and included the PE outputs, the 
> configuration 
> 
> 
> file and the logs for both nodes.
> 
> The test was rerun with max-children of 20.
> 
> Thanks,
> 
> Bob
> 
> 
> ----- Original Message ----
> From: Bob Schatz <[email protected]>
> To: [email protected]
> Sent: Thu, March 24, 2011 7:35:54 PM
> Subject: [Pacemaker] WARN: msg_to_op(1324): failed to get the value of field 
> lrm_opstatus from a ha_msg
> 
> I am getting these messages in the log:
> 
>    2011-03-24 18:53:12| warning |crmd: [27913]: WARN: msg_to_op(1324): failed 
> to 
> 
> 
> 
> get the value of field lrm_opstatus from  a ha_msg
>    2011-03-24 18:53:12| info |crmd: [27913]: info: msg_to_op: Message follows:
>    2011-03-24 18:53:12| info |crmd: [27913]: info: MSG: Dumping message with 
> 16 
> fields
>    2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[0] : [lrm_t=op]
>    2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[1] : 
> [lrm_rid=SSJ0000E02A2:0]
>    2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[2] : [lrm_op=start]
>    2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[3] : 
> [lrm_timeout=300000]
>    2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[4] : [lrm_interval=0]
>    2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[5] : [lrm_delay=0]
>    2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[6] : [lrm_copyparams=1]
>    2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[7] : [lrm_t_run=0]
>    2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[8] : [lrm_t_rcchange=0]
>    2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[9] : [lrm_exec_time=0]
>    2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[10] : 
> [lrm_queue_time=0]
>    2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[11] : [lrm_targetrc=-1]
>    2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[12] : [lrm_app=crmd]
>    2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[13] : 
> [lrm_userdata=91:3:0:dc9ad1c7-1d74-4418-a002-34426b34b576]
>    2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[14] : 
> [(2)lrm_param=0x64c230(938 1098)]
>    2011-03-24 18:53:12| info |crmd: [27913]: info: MSG: Dumping message with 
> 27 
> fields
>    2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[0] : [CRM_meta_clone=0]
>    2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[1] : 
> [CRM_meta_notify_slave_resource= ]
>    2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[2] : 
> [CRM_meta_notify_active_resource= ]
>    2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[3] : 
> [CRM_meta_notify_demote_uname= ]
>    2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[4] : 
> [CRM_meta_notify_inactive_resource=SSJ0000E02A2:0 SSJ0000E02A2:1 ]
>    2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[5] : 
> [ssconf=/var/omneon/config/config.J0000E02A2]
>    2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[6] : 
> [CRM_meta_master_node_max=1]
>    2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[7] : 
> [CRM_meta_notify_stop_resource= ]
>    2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[8] : 
> [CRM_meta_notify_master_resource= ]
>    2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[9] : 
> [CRM_meta_clone_node_max=1]
>    2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[10] : 
> [CRM_meta_clone_max=2]
>    2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[11] : 
> [CRM_meta_notify=true]
>    2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[12] : 
> [CRM_meta_notify_start_resource=SSJ0000E02A2:0 SSJ0000E02A2:1 ]
>    2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[13] : 
> [CRM_meta_notify_stop_uname= ]
>    2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[14] : 
> [crm_feature_set=3.0.1]
>    2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[15] : 
> [CRM_meta_notify_master_uname= ]
>    2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[16] : 
> [CRM_meta_master_max=1]
>    2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[17] : 
> [CRM_meta_globally_unique=false]
>    2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[18] : 
> [CRM_meta_notify_promote_resource=SSJ0000E02A2:0 ]
>    2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[19] : 
> [CRM_meta_notify_promote_uname=mgraid-s0000e02a1-0 ]
>    2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[20] : 
> [CRM_meta_notify_active_uname= ]
>    2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[21] : 
> [CRM_meta_notify_start_uname=mgraid-s0000e02a1-0 mgraid-s0000e02a1-1 ]
>    2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[22] : 
> [CRM_meta_notify_slave_uname= ]
>    2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[23] : 
> [CRM_meta_name=start]
>    2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[24] : 
> [ss_resource=SSJ0000E02A2]
>    2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[25] : 
> [CRM_meta_notify_demote_resource= ]
>    2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[26] : 
> [CRM_meta_timeout=300000]
>    2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[15] : [lrm_callid=15]
> 
> This results in the resources being stopped even though I can see from the 
> logging that the agent START function returned $OCF_SUCCESS.  (The agent 
> start 
> function prints "ss_start()  START" and "ss_start() END" in the logging).
> 
> The START function can take anywhere from 30 - 60 seconds to complete due to 
> our 
> 
> 
> 
> application.
> 
> I am running with 1.0.9 Pacemaker and heartbeat 3.0.3.
> 
> I have attached the configuration as a file to this email since I thought it 
> would make the email unreadable.  (Summary is 6 master/slave resources).
> 
> I have also attached logs .   The above messages are from the file 
> n0-short.txt 
> but also occur in n1-short.txt.  
> 
> I thought that maybe I was running into a problem with the number of threads 
> that lrmd had configured.   I increased in to 40 and proved that it was in 
> affect with:
> 
>    # /sbin/lrmadmin -g max-children
>    max-children: 40
> 
> This problem is reproducible every time.

The missing lrm_opstatus field is due to the operation never
being run hence no status to report. Perhaps this particular
case should have severity reduced to info.

Did you observe any adverse effects otherwise?

Thanks,

Dejan

> Thanks in advance,
> 
> Bob
> 
> 
>       
> 
> _______________________________________________
> Pacemaker mailing list: [email protected]
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: 
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> 
> 
> 
>       
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to