[Linux-HA] heartbeat 'ERROR' messages
I have two clusters that are both running CentOS 5.6 and heartbeat-3.0.3-2.3.el5 (from the clusterlabs repo). THey are running slightly different pacemaker versions (pacemaker-1.0.9.1-1.15.el5 on the first one and pacemaker-1.0.12-1.el5 on the other) They both have identical ha.cf files except that the bcast device names are different (and they are correct for each case, I checked), like this: udpport 694 bcast eth2 bcast eth1 use_logd off logfile /var/log/halog debugfile /var/log/hadebug debug 1 keepalive 2 deadtime 15 initdead 60 node vmd1.ucar.edu node vmd2.ucar.edu auto_failback off respawn hacluster /usr/lib64/heartbeat/ipfail crm respawn On one of them (which maybe or maybe not coincidentally is having some problems), I get these messages logged about every 2 seconds in /var/log/halog, on the other I don't see them: May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG: Dumping message with 10 fields May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[0] : [t=NS_ackmsg] May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[1] : [dest=vmx2.ucar.edu] May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[2] : [ackseq=3a0] May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[3] : [(1)destuuid=0x5ceb280(37 28)] May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[4] : [src=vmx1.ucar.edu] May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[5] : [(1)srcuuid=0x5ceb390(36 27)] May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[6] : [hg=4c97c17a] May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[7] : [ts=51a13435] May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[8] : [ttl=3] May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[9] : [auth=1 23b556bcb61a08abecf87cb6411c62e62cf99f0d] May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG: Dumping message with 12 fields May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[0] : [t=status] May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[1] : [st=active] May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[2] : [dt=3a98] May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[3] : [protocol=1] May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[4] : [src=vmx1.ucar.edu] May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[5] : [(1)srcuuid=0x5ceb390(36 27)] May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[6] : [seq=17b] May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[7] : [hg=4c97c17a] May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[8] : [ts=51a13435] May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[9] : [ld=0.27 0.41 0.26 1/315 19183] May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[10] : [ttl=3] May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[11] : [auth=1 3d3da4df831636f7c274395041ffb49bbf215170] The questions are what do these messages actually mean, why is one cluster logging them and not the other, and is this something I should be worried about? Thanks for any info, --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] heartbeat 'ERROR' messages
I know it's tacky to reply to myself, but I can answer one of my questions after another 15 minutes or so of poring through logs: On Tue, 2013-05-28 at 10:37 -0600, Greg Woods wrote: The questions are what do these messages actually mean, why is one cluster logging them and not the other, and is this something I should be worried about? The answer to the last one is that this is definitely a problem, because after nearly half an hour, this is logged: May 25 16:17:44 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[4] : [src=vmx1.ucar.edu] May 25 16:17:44 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[5] : [(1)srcuuid=0x5ceb390(36 27)] May 25 16:17:44 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[6] : [seq=3a4] May 25 16:17:44 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[7] : [hg=4c97c17a] May 25 16:17:44 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[8] : [ts=51a13888] May 25 16:17:44 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[9] : [ld=0.50 0.33 0.28 3/316 13859] May 25 16:17:44 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[10] : [ttl=3] May 25 16:17:44 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[11] : [auth=1 feb94da356847a538290ea75f27423c996c0a595] May 25 16:17:44 vmx1.ucar.edu heartbeat: [5689]: ERROR: write_child: Exiting due to persistent errors: No such device May 25 16:17:44 vmx1.ucar.edu heartbeat: [5683]: WARN: Managed HBWRITE process 5689 exited with return code 1. May 25 16:17:44 vmx1.ucar.edu heartbeat: [5683]: ERROR: HBWRITE process died. Beginning communications restart process for comm channel 1. May 25 16:17:44 vmx1.ucar.edu heartbeat: [5683]: info: glib: UDP Broadcast heartbeat closed on port 694 interface eth4 - Status: 1 May 25 16:17:44 vmx1.ucar.edu heartbeat: [5683]: WARN: Managed HBREAD process 5690 killed by signal 9 [SIGKILL - Kill, unblockable]. May 25 16:17:44 vmx1.ucar.edu heartbeat: [5683]: ERROR: Both comm processes for channel 1 have died. Restarting. May 25 16:17:44 vmx1.ucar.edu heartbeat: [5683]: info: glib: UDP Broadcast heartbeat started on port 694 (694) interface eth4 May 25 16:17:44 vmx1.ucar.edu heartbeat: [5683]: info: glib: UDP Broadcast heartbeat closed on port 694 interface eth4 - Status: 1 May 25 16:17:44 vmx1.ucar.edu heartbeat: [5683]: info: Communications restart succeeded. May 25 16:17:45 vmx1.ucar.edu heartbeat: [5683]: info: Link vmx2.ucar.edu:eth4 up. And VMs stop being reachable, etc. The only way to stabilize things is to not start heartbeat on one of the nodes (vmx1 arbitrarily chosen) and run all resources on a single node (vmx2 in this case). --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Problem with crm shadow CIB's
On Mon, 27 May 2013, Dejan Muhamedagic wrote: Hi, On Wed, May 22, 2013 at 01:20:06PM +, Tony Stocker wrote: Version Info: OS: CentOS 6.4 Kernel (current): 2.6.32-358.6.2.el6.x86_64 Pacemaker: 1.1.8-7.el6 Corosync: 1.4.1-15.el6_4 CRMSH: 1.2.5-55.4 I was attempting to create a shadow CIB so that I could test changes and I am unsuccessful getting it to work: # crm cib crm(live)cib# new test-conf INFO: 8: test-conf shadow CIB created ERROR: 8: test-conf: no such shadow CIB If crmsh is coming from http://download.opensuse.org/repositories/network:/ha-clustering/CentOS_CentOS-6/ then it was built against pacemaker v1.1.7 (that should change any day now). In the meantime, you can create a link as suggested here: Yes that is where I'm getting crmsh from. https://savannah.nongnu.org/bugs/?39013#comment0 Okay, I'll give that a try. Thanks. Another option is to build crmsh on a host with pacemaker = 1.1.8 installed. I'm trying to avoid building libraries/utilities if I can. Thanks, Dejan crm(live)cib# reset test-conf INFO: 9: copied live CIB to test-conf crm(live)cib# list crm(live)cib# new test-conf A shadow instance 'test-conf' already exists. To prevent accidental destruction of the cluster, the --force flag is required in order to proceed. crm(live)cib# use test-conf ERROR: 12: test-conf: no such shadow CIB crm(live)cib# delete test-conf INFO: 13: test-conf shadow CIB deleted So it appears to be there at some level since I can copy the live CIB, and it complains if I try to create the same name again, and I can delete it. But I cannot see it via 'list' and I cannot 'use' it. Is this a bug or am I doing something wrong? I'm following the examples in the documentation and from here: http://clusterlabs.org/wiki/Example_configurations Tony -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] heartbeat 'ERROR' messages
On 29/05/2013, at 2:37 AM, Greg Woods wo...@ucar.edu wrote: I have two clusters that are both running CentOS 5.6 and heartbeat-3.0.3-2.3.el5 (from the clusterlabs repo). THey are running slightly different pacemaker versions (pacemaker-1.0.9.1-1.15.el5 on the first one and pacemaker-1.0.12-1.el5 on the other) They both have identical ha.cf files except that the bcast device names are different (and they are correct for each case, I checked), like this: udpport 694 bcast eth2 bcast eth1 use_logd off logfile /var/log/halog debugfile /var/log/hadebug debug 1 keepalive 2 deadtime 15 initdead 60 node vmd1.ucar.edu node vmd2.ucar.edu auto_failback off respawn hacluster /usr/lib64/heartbeat/ipfail crm respawn I don't know about the rest, but definitely do not use both ipfail and crm. Pick one :) On one of them (which maybe or maybe not coincidentally is having some problems), I get these messages logged about every 2 seconds in /var/log/halog, on the other I don't see them: May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG: Dumping message with 10 fields May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[0] : [t=NS_ackmsg] May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[1] : [dest=vmx2.ucar.edu] May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[2] : [ackseq=3a0] May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[3] : [(1)destuuid=0x5ceb280(37 28)] May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[4] : [src=vmx1.ucar.edu] May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[5] : [(1)srcuuid=0x5ceb390(36 27)] May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[6] : [hg=4c97c17a] May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[7] : [ts=51a13435] May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[8] : [ttl=3] May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[9] : [auth=1 23b556bcb61a08abecf87cb6411c62e62cf99f0d] May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG: Dumping message with 12 fields May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[0] : [t=status] May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[1] : [st=active] May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[2] : [dt=3a98] May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[3] : [protocol=1] May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[4] : [src=vmx1.ucar.edu] May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[5] : [(1)srcuuid=0x5ceb390(36 27)] May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[6] : [seq=17b] May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[7] : [hg=4c97c17a] May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[8] : [ts=51a13435] May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[9] : [ld=0.27 0.41 0.26 1/315 19183] May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[10] : [ttl=3] May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[11] : [auth=1 3d3da4df831636f7c274395041ffb49bbf215170] The questions are what do these messages actually mean, why is one cluster logging them and not the other, and is this something I should be worried about? Thanks for any info, --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] heartbeat 'ERROR' messages
On Wed, 2013-05-29 at 07:50 +1000, Andrew Beekhof wrote: respawn hacluster /usr/lib64/heartbeat/ipfail crm respawn I don't know about the rest, but definitely do not use both ipfail and crm. Pick one :) I guess I will have to look into what ipfail really does. I have a half dozen clusters that have virtually the same ha.cf files and they have been running for 2+ years with it specified this way. --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] heartbeat 'ERROR' messages
On 29/05/2013, at 8:05 AM, Greg Woods wo...@ucar.edu wrote: On Wed, 2013-05-29 at 07:50 +1000, Andrew Beekhof wrote: respawn hacluster /usr/lib64/heartbeat/ipfail crm respawn I don't know about the rest, but definitely do not use both ipfail and crm. Pick one :) I guess I will have to look into what ipfail really does. With crm enabled, nothing. Try http://clusterlabs.org/doc/en-US/Pacemaker/1.1-plugin/html/Pacemaker_Explained/_moving_resources_due_to_connectivity_changes.html I have a half dozen clusters that have virtually the same ha.cf files and they have been running for 2+ years with it specified this way. --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems