On Thu, Jun 26, 2014 at 01:30:01PM +0200, Lars Ellenberg wrote: > On Tue, Jun 24, 2014 at 11:20:48PM +0300, Pasi Kärkkäinen wrote: > > Hello! > > > > I've been seeing heartbeat cluster problems in Linux-based Vyatta and more > > recent VyOS networking/router appliances. > > These are currently based on Debian Squeeze, and thus are using: > > > > Package: heartbeat > > Version: 1:3.0.3-2 > > Please use 3.0.5: > http://hg.linux-ha.org/heartbeat-STABLE_3_0/archive/37f57a36a2dd.tar.bz2 >
Do you think v3.0.5 fixes the issue of heartbeat process crashing? This patch perhaps? http://hg.linux-ha.org/heartbeat-STABLE_3_0/rev/3e51db646a21 Thanks, -- Pasi > > VyOS bug report: http://bugzilla.vyos.net/show_bug.cgi?id=244 > > > > The problem is that when there are (unexpected) networking problems causing > > multicast issues, > > which cause problems in the inter-cluster communications, the heartbeat > > processes will die on the cluster nodes, > > which is bad, right? I assume heartbeat should never die, especially not > > because of temporary networking issues.. > > > > I've also seen heartbeat dying because of temporary network maintenance > > breaks.. > > > > Basicly first I'm seeing this kind of messages: > > > > Jun 23 17:55:02 vyos03 heartbeat: [4119]: WARN: node vyos01: is dead > > Jun 23 17:59:23 vyos03 heartbeat: [4119]: CRIT: Cluster node vyos01 > > returning after partition. > > Jun 23 17:59:23 vyos03 heartbeat: [4119]: WARN: Deadtime value may be too > > small. > > Jun 23 17:59:23 vyos03 heartbeat: [4119]: WARN: Late heartbeat: Node > > vyos01: interval 273580 ms > > Jun 23 17:59:23 vyos03 harc[4961]: info: Running /etc/ha.d//rc.d/status > > status > > Jun 23 17:59:25 vyos03 ResourceManager[4991]: info: Releasing resource > > group: vyos01 IPaddr2-vyatta::10.0.0.10/24/eth1 > > Jun 23 17:59:25 vyos03 ResourceManager[4991]: info: Running > > /etc/ha.d/resource.d/IPaddr2-vyatta 10.0.0.10/24/eth1 stop > > Jun 23 17:59:26 vyos03 heartbeat: [4119]: WARN: 1 lost packet(s) for > > [vyos01] [421:423] > > Jun 23 17:59:39 vyos03 heartbeat: [4119]: WARN: Logging daemon is disabled > > --enabling logging daemon is recommended > > Jun 23 17:59:40 vyos03 harc[5102]: info: Running /etc/ha.d//rc.d/status > > status > > > > Which seem normal in the case of networking problem.. But then later: > > > > Jun 23 19:31:22 vyos03 heartbeat: [10921]: ERROR: Message hist queue is > > filling up (494 messages in queue) > > Jun 23 19:31:22 vyos03 heartbeat: [10921]: ERROR: Message hist queue is > > filling up (495 messages in queue) > > Jun 23 19:31:23 vyos03 heartbeat: [10921]: ERROR: Message hist queue is > > filling up (496 messages in queue) > > Jun 23 19:31:24 vyos03 heartbeat: [10921]: ERROR: Message hist queue is > > filling up (497 messages in queue) > > Jun 23 19:31:24 vyos03 heartbeat: [10921]: ERROR: Message hist queue is > > filling up (498 messages in queue) > > Jun 23 19:31:25 vyos03 heartbeat: [10921]: ERROR: Message hist queue is > > filling up (499 messages in queue) > > Jun 23 19:31:26 vyos03 heartbeat: [10921]: ERROR: Message hist queue is > > filling up (500 messages in queue) > > Jun 23 19:31:42 vyos03 heartbeat: last message repeated 25 times > > > > > > The "hist queue" size keeps increasing, and when it gets to 500 messages > > bad things start happening.. > > > > > > Jun 23 19:31:43 vyos03 heartbeat: [10921]: ERROR: Message hist queue is > > filling up (500 messages in queue) > > Jun 23 19:31:49 vyos03 heartbeat: last message repeated 9 times > > Jun 23 19:31:49 vyos03 heartbeat: [10921]: ERROR: lowseq cannnot be greater > > than ackseq > > Jun 23 19:31:50 vyos03 heartbeat: [10923]: CRIT: Emergency Shutdown: Master > > Control process died. > > Jun 23 19:31:50 vyos03 heartbeat: [10923]: CRIT: Killing pid 10921 with > > SIGTERM > > Jun 23 19:31:50 vyos03 heartbeat: [10923]: CRIT: Killing pid 10924 with > > SIGTERM > > Jun 23 19:31:50 vyos03 heartbeat: [10923]: CRIT: Killing pid 10925 with > > SIGTERM > > Jun 23 19:31:50 vyos03 heartbeat: [10923]: CRIT: Emergency Shutdown(MCP > > dead): Killing ourselves. > > > > At this point clustering has failed, because the heartbeat > > services/processes aren't running anymore.. > > > > Has anyone else seen this? > > It has been fixed years ago ... > > > It seems the bug gets triggered at 500 messages in the hist queue, > > and then I always see the "ERROR: lowseq cannnot be greater than ackseq" > > and then heartbeat dies.. > > -- > : Lars Ellenberg > : LINBIT | Your Way to High Availability > : DRBD/HA support and consulting http://www.linbit.com > > DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. > _______________________________________________ > Linux-HA mailing list > Linux-HA@lists.linux-ha.org > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems