On Thu, Jun 26, 2014 at 01:30:01PM +0200, Lars Ellenberg wrote:
On Tue, Jun 24, 2014 at 11:20:48PM +0300, Pasi Kärkkäinen wrote:
Hello!
I've been seeing heartbeat cluster problems in Linux-based Vyatta and more
recent VyOS networking/router appliances.
These are currently based on Debian Squeeze, and thus are using:
Package: heartbeat
Version: 1:3.0.3-2
Please use 3.0.5:
http://hg.linux-ha.org/heartbeat-STABLE_3_0/archive/37f57a36a2dd.tar.bz2
Do you think v3.0.5 fixes the issue of heartbeat process crashing?
This patch perhaps? http://hg.linux-ha.org/heartbeat-STABLE_3_0/rev/3e51db646a21
Thanks,
-- Pasi
VyOS bug report: http://bugzilla.vyos.net/show_bug.cgi?id=244
The problem is that when there are (unexpected) networking problems causing
multicast issues,
which cause problems in the inter-cluster communications, the heartbeat
processes will die on the cluster nodes,
which is bad, right? I assume heartbeat should never die, especially not
because of temporary networking issues..
I've also seen heartbeat dying because of temporary network maintenance
breaks..
Basicly first I'm seeing this kind of messages:
Jun 23 17:55:02 vyos03 heartbeat: [4119]: WARN: node vyos01: is dead
Jun 23 17:59:23 vyos03 heartbeat: [4119]: CRIT: Cluster node vyos01
returning after partition.
Jun 23 17:59:23 vyos03 heartbeat: [4119]: WARN: Deadtime value may be too
small.
Jun 23 17:59:23 vyos03 heartbeat: [4119]: WARN: Late heartbeat: Node
vyos01: interval 273580 ms
Jun 23 17:59:23 vyos03 harc[4961]: info: Running /etc/ha.d//rc.d/status
status
Jun 23 17:59:25 vyos03 ResourceManager[4991]: info: Releasing resource
group: vyos01 IPaddr2-vyatta::10.0.0.10/24/eth1
Jun 23 17:59:25 vyos03 ResourceManager[4991]: info: Running
/etc/ha.d/resource.d/IPaddr2-vyatta 10.0.0.10/24/eth1 stop
Jun 23 17:59:26 vyos03 heartbeat: [4119]: WARN: 1 lost packet(s) for
[vyos01] [421:423]
Jun 23 17:59:39 vyos03 heartbeat: [4119]: WARN: Logging daemon is disabled
--enabling logging daemon is recommended
Jun 23 17:59:40 vyos03 harc[5102]: info: Running /etc/ha.d//rc.d/status
status
Which seem normal in the case of networking problem.. But then later:
Jun 23 19:31:22 vyos03 heartbeat: [10921]: ERROR: Message hist queue is
filling up (494 messages in queue)
Jun 23 19:31:22 vyos03 heartbeat: [10921]: ERROR: Message hist queue is
filling up (495 messages in queue)
Jun 23 19:31:23 vyos03 heartbeat: [10921]: ERROR: Message hist queue is
filling up (496 messages in queue)
Jun 23 19:31:24 vyos03 heartbeat: [10921]: ERROR: Message hist queue is
filling up (497 messages in queue)
Jun 23 19:31:24 vyos03 heartbeat: [10921]: ERROR: Message hist queue is
filling up (498 messages in queue)
Jun 23 19:31:25 vyos03 heartbeat: [10921]: ERROR: Message hist queue is
filling up (499 messages in queue)
Jun 23 19:31:26 vyos03 heartbeat: [10921]: ERROR: Message hist queue is
filling up (500 messages in queue)
Jun 23 19:31:42 vyos03 heartbeat: last message repeated 25 times
The hist queue size keeps increasing, and when it gets to 500 messages
bad things start happening..
Jun 23 19:31:43 vyos03 heartbeat: [10921]: ERROR: Message hist queue is
filling up (500 messages in queue)
Jun 23 19:31:49 vyos03 heartbeat: last message repeated 9 times
Jun 23 19:31:49 vyos03 heartbeat: [10921]: ERROR: lowseq cannnot be greater
than ackseq
Jun 23 19:31:50 vyos03 heartbeat: [10923]: CRIT: Emergency Shutdown: Master
Control process died.
Jun 23 19:31:50 vyos03 heartbeat: [10923]: CRIT: Killing pid 10921 with
SIGTERM
Jun 23 19:31:50 vyos03 heartbeat: [10923]: CRIT: Killing pid 10924 with
SIGTERM
Jun 23 19:31:50 vyos03 heartbeat: [10923]: CRIT: Killing pid 10925 with
SIGTERM
Jun 23 19:31:50 vyos03 heartbeat: [10923]: CRIT: Emergency Shutdown(MCP
dead): Killing ourselves.
At this point clustering has failed, because the heartbeat
services/processes aren't running anymore..
Has anyone else seen this?
It has been fixed years ago ...
It seems the bug gets triggered at 500 messages in the hist queue,
and then I always see the ERROR: lowseq cannnot be greater than ackseq
and then heartbeat dies..
--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com
DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems