[Linux-HA] heartbeat 'ERROR' messages

2013-05-28 Thread Greg Woods
I have two clusters that are both running CentOS 5.6 and
heartbeat-3.0.3-2.3.el5 (from the clusterlabs repo). THey are running
slightly different pacemaker versions (pacemaker-1.0.9.1-1.15.el5 on the
first one and pacemaker-1.0.12-1.el5 on the other) They both have
identical ha.cf files except that the bcast device names are different
(and they are correct for each case, I checked), like this:

udpport 694
bcast eth2
bcast eth1
use_logd off
logfile /var/log/halog
debugfile /var/log/hadebug
debug 1
keepalive 2
deadtime 15
initdead 60
node vmd1.ucar.edu
node vmd2.ucar.edu
auto_failback off
respawn hacluster /usr/lib64/heartbeat/ipfail
crm respawn

On one of them (which maybe or maybe not coincidentally is having some
problems), I get these messages logged about every 2 seconds
in /var/log/halog, on the other I don't see them:

May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG: Dumping
message with 10 fields
May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[0] :
[t=NS_ackmsg]
May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[1] :
[dest=vmx2.ucar.edu]
May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[2] :
[ackseq=3a0]
May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[3] :
[(1)destuuid=0x5ceb280(37 28)]
May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[4] :
[src=vmx1.ucar.edu]
May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[5] :
[(1)srcuuid=0x5ceb390(36 27)]
May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[6] :
[hg=4c97c17a]
May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[7] :
[ts=51a13435]
May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[8] : [ttl=3]
May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[9] : [auth=1
23b556bcb61a08abecf87cb6411c62e62cf99f0d]
May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG: Dumping
message with 12 fields
May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[0] :
[t=status]
May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[1] :
[st=active]
May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[2] :
[dt=3a98]
May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[3] :
[protocol=1]
May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[4] :
[src=vmx1.ucar.edu]
May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[5] :
[(1)srcuuid=0x5ceb390(36 27)]
May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[6] :
[seq=17b]
May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[7] :
[hg=4c97c17a]
May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[8] :
[ts=51a13435]
May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[9] :
[ld=0.27 0.41 0.26 1/315 19183]
May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[10] :
[ttl=3]
May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[11] :
[auth=1 3d3da4df831636f7c274395041ffb49bbf215170]

The questions are what do these messages actually mean, why is one
cluster logging them and not the other, and is this something I should
be worried about?

Thanks for any info,
--Greg


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] heartbeat 'ERROR' messages

2013-05-28 Thread Greg Woods
I know it's tacky to reply to myself, but I can answer one of my
questions after another 15 minutes or so of poring through logs:

On Tue, 2013-05-28 at 10:37 -0600, Greg Woods wrote:

 
 The questions are what do these messages actually mean, why is one
 cluster logging them and not the other, and is this something I should
 be worried about?

The answer to the last one is that this is definitely a problem, because
after nearly half an hour, this is logged:

May 25 16:17:44 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[4] :
[src=vmx1.ucar.edu]
May 25 16:17:44 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[5] :
[(1)srcuuid=0x5ceb390(36 27)]
May 25 16:17:44 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[6] :
[seq=3a4]
May 25 16:17:44 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[7] :
[hg=4c97c17a]
May 25 16:17:44 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[8] :
[ts=51a13888]
May 25 16:17:44 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[9] :
[ld=0.50 0.33 0.28 3/316 13859]
May 25 16:17:44 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[10] :
[ttl=3]
May 25 16:17:44 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[11] :
[auth=1 feb94da356847a538290ea75f27423c996c0a595]
May 25 16:17:44 vmx1.ucar.edu heartbeat: [5689]: ERROR: write_child:
Exiting due to persistent errors: No such device
May 25 16:17:44 vmx1.ucar.edu heartbeat: [5683]: WARN: Managed HBWRITE
process 5689 exited with return code 1.
May 25 16:17:44 vmx1.ucar.edu heartbeat: [5683]: ERROR: HBWRITE process
died.  Beginning communications restart process for comm channel 1.
May 25 16:17:44 vmx1.ucar.edu heartbeat: [5683]: info: glib: UDP
Broadcast heartbeat closed on port 694 interface eth4 - Status: 1
May 25 16:17:44 vmx1.ucar.edu heartbeat: [5683]: WARN: Managed HBREAD
process 5690 killed by signal 9 [SIGKILL - Kill, unblockable].
May 25 16:17:44 vmx1.ucar.edu heartbeat: [5683]: ERROR: Both comm
processes for channel 1 have died.  Restarting.
May 25 16:17:44 vmx1.ucar.edu heartbeat: [5683]: info: glib: UDP
Broadcast heartbeat started on port 694 (694) interface eth4
May 25 16:17:44 vmx1.ucar.edu heartbeat: [5683]: info: glib: UDP
Broadcast heartbeat closed on port 694 interface eth4 - Status: 1
May 25 16:17:44 vmx1.ucar.edu heartbeat: [5683]: info: Communications
restart succeeded.
May 25 16:17:45 vmx1.ucar.edu heartbeat: [5683]: info: Link
vmx2.ucar.edu:eth4 up.

And VMs stop being reachable, etc. The only way to stabilize things is
to not start heartbeat on one of the nodes (vmx1 arbitrarily chosen) and
run all resources on a single node (vmx2 in this case).

--Greg


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Problem with crm shadow CIB's

2013-05-28 Thread Tony Stocker
On Mon, 27 May 2013, Dejan Muhamedagic wrote:

 Hi,

 On Wed, May 22, 2013 at 01:20:06PM +, Tony Stocker wrote:

 Version Info:
 OS: CentOS 6.4
 Kernel (current):   2.6.32-358.6.2.el6.x86_64
 Pacemaker:  1.1.8-7.el6
 Corosync:   1.4.1-15.el6_4
 CRMSH:  1.2.5-55.4


 I was attempting to create a shadow CIB so that I could test changes and
 I am unsuccessful getting it to work:

 # crm cib
 crm(live)cib# new test-conf
 INFO: 8: test-conf shadow CIB created
 ERROR: 8: test-conf: no such shadow CIB

 If crmsh is coming from
 http://download.opensuse.org/repositories/network:/ha-clustering/CentOS_CentOS-6/
 then it was built against pacemaker v1.1.7 (that should change
 any day now). In the meantime, you can create a link as suggested
 here:

Yes that is where I'm getting crmsh from.


 https://savannah.nongnu.org/bugs/?39013#comment0


Okay, I'll give that a try.  Thanks.

 Another option is to build crmsh on a host with pacemaker = 1.1.8
 installed.


I'm trying to avoid building libraries/utilities if I can.

 Thanks,

 Dejan

 crm(live)cib# reset test-conf
 INFO: 9: copied live CIB to test-conf
 crm(live)cib# list
 crm(live)cib# new test-conf
 A shadow instance 'test-conf' already exists.
To prevent accidental destruction of the cluster, the --force flag is
 required in order to proceed.
 crm(live)cib# use test-conf
 ERROR: 12: test-conf: no such shadow CIB
 crm(live)cib# delete test-conf
 INFO: 13: test-conf shadow CIB deleted

 So it appears to be there at some level since I can copy the live CIB, and
 it complains if I try to create the same name again, and I can delete it.
 But I cannot see it via 'list' and I cannot 'use' it.


 Is this a bug or am I doing something wrong?  I'm following the examples
 in the documentation and from here:
 http://clusterlabs.org/wiki/Example_configurations

 Tony


-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] heartbeat 'ERROR' messages

2013-05-28 Thread Andrew Beekhof

On 29/05/2013, at 2:37 AM, Greg Woods wo...@ucar.edu wrote:

 I have two clusters that are both running CentOS 5.6 and
 heartbeat-3.0.3-2.3.el5 (from the clusterlabs repo). THey are running
 slightly different pacemaker versions (pacemaker-1.0.9.1-1.15.el5 on the
 first one and pacemaker-1.0.12-1.el5 on the other) They both have
 identical ha.cf files except that the bcast device names are different
 (and they are correct for each case, I checked), like this:
 
 udpport 694
 bcast eth2
 bcast eth1
 use_logd off
 logfile /var/log/halog
 debugfile /var/log/hadebug
 debug 1
 keepalive 2
 deadtime 15
 initdead 60
 node vmd1.ucar.edu
 node vmd2.ucar.edu
 auto_failback off
 respawn hacluster /usr/lib64/heartbeat/ipfail
 crm respawn

I don't know about the rest, but definitely do not use both ipfail and crm.
Pick one :)

 
 On one of them (which maybe or maybe not coincidentally is having some
 problems), I get these messages logged about every 2 seconds
 in /var/log/halog, on the other I don't see them:
 
 May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG: Dumping
 message with 10 fields
 May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[0] :
 [t=NS_ackmsg]
 May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[1] :
 [dest=vmx2.ucar.edu]
 May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[2] :
 [ackseq=3a0]
 May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[3] :
 [(1)destuuid=0x5ceb280(37 28)]
 May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[4] :
 [src=vmx1.ucar.edu]
 May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[5] :
 [(1)srcuuid=0x5ceb390(36 27)]
 May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[6] :
 [hg=4c97c17a]
 May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[7] :
 [ts=51a13435]
 May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[8] : [ttl=3]
 May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[9] : [auth=1
 23b556bcb61a08abecf87cb6411c62e62cf99f0d]
 May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG: Dumping
 message with 12 fields
 May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[0] :
 [t=status]
 May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[1] :
 [st=active]
 May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[2] :
 [dt=3a98]
 May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[3] :
 [protocol=1]
 May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[4] :
 [src=vmx1.ucar.edu]
 May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[5] :
 [(1)srcuuid=0x5ceb390(36 27)]
 May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[6] :
 [seq=17b]
 May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[7] :
 [hg=4c97c17a]
 May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[8] :
 [ts=51a13435]
 May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[9] :
 [ld=0.27 0.41 0.26 1/315 19183]
 May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[10] :
 [ttl=3]
 May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[11] :
 [auth=1 3d3da4df831636f7c274395041ffb49bbf215170]
 
 The questions are what do these messages actually mean, why is one
 cluster logging them and not the other, and is this something I should
 be worried about?
 
 Thanks for any info,
 --Greg
 
 
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] heartbeat 'ERROR' messages

2013-05-28 Thread Greg Woods
On Wed, 2013-05-29 at 07:50 +1000, Andrew Beekhof wrote:

  respawn hacluster /usr/lib64/heartbeat/ipfail
  crm respawn
 
 I don't know about the rest, but definitely do not use both ipfail and crm.
 Pick one :)

I guess I will have to look into what ipfail really does. I have a half
dozen clusters that have virtually the same ha.cf files and they have
been running for 2+ years with it specified this way.

--Greg


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] heartbeat 'ERROR' messages

2013-05-28 Thread Andrew Beekhof

On 29/05/2013, at 8:05 AM, Greg Woods wo...@ucar.edu wrote:

 On Wed, 2013-05-29 at 07:50 +1000, Andrew Beekhof wrote:
 
 respawn hacluster /usr/lib64/heartbeat/ipfail
 crm respawn
 
 I don't know about the rest, but definitely do not use both ipfail and crm.
 Pick one :)
 
 I guess I will have to look into what ipfail really does.

With crm enabled, nothing.
Try 
http://clusterlabs.org/doc/en-US/Pacemaker/1.1-plugin/html/Pacemaker_Explained/_moving_resources_due_to_connectivity_changes.html

 I have a half
 dozen clusters that have virtually the same ha.cf files and they have
 been running for 2+ years with it specified this way.
 
 --Greg
 
 
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems