I know it's tacky to reply to myself, but I can answer one of my
questions after another 15 minutes or so of poring through logs:

On Tue, 2013-05-28 at 10:37 -0600, Greg Woods wrote:

> 
> The questions are what do these messages actually mean, why is one
> cluster logging them and not the other, and is this something I should
> be worried about?

The answer to the last one is that this is definitely a problem, because
after nearly half an hour, this is logged:

May 25 16:17:44 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[4] :
[src=vmx1.ucar.edu]
May 25 16:17:44 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[5] :
[(1)srcuuid=0x5ceb390(36 27)]
May 25 16:17:44 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[6] :
[seq=3a4]
May 25 16:17:44 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[7] :
[hg=4c97c17a]
May 25 16:17:44 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[8] :
[ts=51a13888]
May 25 16:17:44 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[9] :
[ld=0.50 0.33 0.28 3/316 13859]
May 25 16:17:44 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[10] :
[ttl=3]
May 25 16:17:44 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[11] :
[auth=1 feb94da356847a538290ea75f27423c996c0a595]
May 25 16:17:44 vmx1.ucar.edu heartbeat: [5689]: ERROR: write_child:
Exiting due to persistent errors: No such device
May 25 16:17:44 vmx1.ucar.edu heartbeat: [5683]: WARN: Managed HBWRITE
process 5689 exited with return code 1.
May 25 16:17:44 vmx1.ucar.edu heartbeat: [5683]: ERROR: HBWRITE process
died.  Beginning communications restart process for comm channel 1.
May 25 16:17:44 vmx1.ucar.edu heartbeat: [5683]: info: glib: UDP
Broadcast heartbeat closed on port 694 interface eth4 - Status: 1
May 25 16:17:44 vmx1.ucar.edu heartbeat: [5683]: WARN: Managed HBREAD
process 5690 killed by signal 9 [SIGKILL - Kill, unblockable].
May 25 16:17:44 vmx1.ucar.edu heartbeat: [5683]: ERROR: Both comm
processes for channel 1 have died.  Restarting.
May 25 16:17:44 vmx1.ucar.edu heartbeat: [5683]: info: glib: UDP
Broadcast heartbeat started on port 694 (694) interface eth4
May 25 16:17:44 vmx1.ucar.edu heartbeat: [5683]: info: glib: UDP
Broadcast heartbeat closed on port 694 interface eth4 - Status: 1
May 25 16:17:44 vmx1.ucar.edu heartbeat: [5683]: info: Communications
restart succeeded.
May 25 16:17:45 vmx1.ucar.edu heartbeat: [5683]: info: Link
vmx2.ucar.edu:eth4 up.

And VMs stop being reachable, etc. The only way to stabilize things is
to not start heartbeat on one of the nodes (vmx1 arbitrarily chosen) and
run all resources on a single node (vmx2 in this case).

--Greg


_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to