On Thu, Jun 26, 2014 at 01:30:01PM +0200, Lars Ellenberg wrote:
> On Tue, Jun 24, 2014 at 11:20:48PM +0300, Pasi Kärkkäinen wrote:
> > Hello!
> > 
> > I've been seeing heartbeat cluster problems in Linux-based Vyatta and more 
> > recent VyOS networking/router appliances.
> > These are currently based on Debian Squeeze, and thus are using:
> > 
> > Package: heartbeat
> > Version: 1:3.0.3-2
> 
> Please use 3.0.5:
> http://hg.linux-ha.org/heartbeat-STABLE_3_0/archive/37f57a36a2dd.tar.bz2
> 

Do you think v3.0.5 fixes the issue of heartbeat process crashing? 

This patch perhaps? http://hg.linux-ha.org/heartbeat-STABLE_3_0/rev/3e51db646a21


Thanks,

-- Pasi

> > VyOS bug report: http://bugzilla.vyos.net/show_bug.cgi?id=244
> > 
> > The problem is that when there are (unexpected) networking problems causing 
> > multicast issues,
> > which cause problems in the inter-cluster communications, the heartbeat 
> > processes will die on the cluster nodes,
> > which is bad, right? I assume heartbeat should never die, especially not 
> > because of temporary networking issues..
> > 
> > I've also seen heartbeat dying because of temporary network maintenance 
> > breaks..
> > 
> > Basicly first I'm seeing this kind of messages:
> > 
> > Jun 23 17:55:02 vyos03 heartbeat: [4119]: WARN: node vyos01: is dead
> > Jun 23 17:59:23 vyos03 heartbeat: [4119]: CRIT: Cluster node vyos01 
> > returning after partition.
> > Jun 23 17:59:23 vyos03 heartbeat: [4119]: WARN: Deadtime value may be too 
> > small.
> > Jun 23 17:59:23 vyos03 heartbeat: [4119]: WARN: Late heartbeat: Node 
> > vyos01: interval 273580 ms
> > Jun 23 17:59:23 vyos03 harc[4961]: info: Running /etc/ha.d//rc.d/status 
> > status
> > Jun 23 17:59:25 vyos03 ResourceManager[4991]: info: Releasing resource 
> > group: vyos01 IPaddr2-vyatta::10.0.0.10/24/eth1
> > Jun 23 17:59:25 vyos03 ResourceManager[4991]: info: Running 
> > /etc/ha.d/resource.d/IPaddr2-vyatta 10.0.0.10/24/eth1 stop
> > Jun 23 17:59:26 vyos03 heartbeat: [4119]: WARN: 1 lost packet(s) for 
> > [vyos01] [421:423]
> > Jun 23 17:59:39 vyos03 heartbeat: [4119]: WARN: Logging daemon is disabled 
> > --enabling logging daemon is recommended
> > Jun 23 17:59:40 vyos03 harc[5102]: info: Running /etc/ha.d//rc.d/status 
> > status
> > 
> > Which seem normal in the case of networking problem.. But then later:
> > 
> > Jun 23 19:31:22 vyos03 heartbeat: [10921]: ERROR: Message hist queue is 
> > filling up (494 messages in queue)
> > Jun 23 19:31:22 vyos03 heartbeat: [10921]: ERROR: Message hist queue is 
> > filling up (495 messages in queue)
> > Jun 23 19:31:23 vyos03 heartbeat: [10921]: ERROR: Message hist queue is 
> > filling up (496 messages in queue)
> > Jun 23 19:31:24 vyos03 heartbeat: [10921]: ERROR: Message hist queue is 
> > filling up (497 messages in queue)
> > Jun 23 19:31:24 vyos03 heartbeat: [10921]: ERROR: Message hist queue is 
> > filling up (498 messages in queue)
> > Jun 23 19:31:25 vyos03 heartbeat: [10921]: ERROR: Message hist queue is 
> > filling up (499 messages in queue)
> > Jun 23 19:31:26 vyos03 heartbeat: [10921]: ERROR: Message hist queue is 
> > filling up (500 messages in queue)
> > Jun 23 19:31:42 vyos03 heartbeat: last message repeated 25 times
> > 
> > 
> > The "hist queue" size keeps increasing, and when it gets to 500 messages 
> > bad things start happening..
> > 
> > 
> > Jun 23 19:31:43 vyos03 heartbeat: [10921]: ERROR: Message hist queue is 
> > filling up (500 messages in queue)
> > Jun 23 19:31:49 vyos03 heartbeat: last message repeated 9 times
> > Jun 23 19:31:49 vyos03 heartbeat: [10921]: ERROR: lowseq cannnot be greater 
> > than ackseq
> > Jun 23 19:31:50 vyos03 heartbeat: [10923]: CRIT: Emergency Shutdown: Master 
> > Control process died.
> > Jun 23 19:31:50 vyos03 heartbeat: [10923]: CRIT: Killing pid 10921 with 
> > SIGTERM
> > Jun 23 19:31:50 vyos03 heartbeat: [10923]: CRIT: Killing pid 10924 with 
> > SIGTERM
> > Jun 23 19:31:50 vyos03 heartbeat: [10923]: CRIT: Killing pid 10925 with 
> > SIGTERM
> > Jun 23 19:31:50 vyos03 heartbeat: [10923]: CRIT: Emergency Shutdown(MCP 
> > dead): Killing ourselves.
> > 
> > At this point clustering has failed, because the heartbeat 
> > services/processes aren't running anymore..
> > 
> > Has anyone else seen this? 
> 
> It has been fixed years ago ...
> 
> > It seems the bug gets triggered at 500 messages in the hist queue,
> > and then I always see the "ERROR: lowseq cannnot be greater than ackseq" 
> > and then heartbeat dies..
> 
> -- 
> : Lars Ellenberg
> : LINBIT | Your Way to High Availability
> : DRBD/HA support and consulting http://www.linbit.com
> 
> DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
> _______________________________________________
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to