Re: [Linux-HA] heartbeat 3.0.3 crashes if there are networking/multicast issues (ERROR: lowseq cannnot be greater than ackseq)

2014-06-30 Thread Pasi Kärkkäinen
On Thu, Jun 26, 2014 at 01:30:01PM +0200, Lars Ellenberg wrote:
 On Tue, Jun 24, 2014 at 11:20:48PM +0300, Pasi Kärkkäinen wrote:
  Hello!
  
  I've been seeing heartbeat cluster problems in Linux-based Vyatta and more 
  recent VyOS networking/router appliances.
  These are currently based on Debian Squeeze, and thus are using:
  
  Package: heartbeat
  Version: 1:3.0.3-2
 
 Please use 3.0.5:
 http://hg.linux-ha.org/heartbeat-STABLE_3_0/archive/37f57a36a2dd.tar.bz2
 

Do you think v3.0.5 fixes the issue of heartbeat process crashing? 

This patch perhaps? http://hg.linux-ha.org/heartbeat-STABLE_3_0/rev/3e51db646a21


Thanks,

-- Pasi

  VyOS bug report: http://bugzilla.vyos.net/show_bug.cgi?id=244
  
  The problem is that when there are (unexpected) networking problems causing 
  multicast issues,
  which cause problems in the inter-cluster communications, the heartbeat 
  processes will die on the cluster nodes,
  which is bad, right? I assume heartbeat should never die, especially not 
  because of temporary networking issues..
  
  I've also seen heartbeat dying because of temporary network maintenance 
  breaks..
  
  Basicly first I'm seeing this kind of messages:
  
  Jun 23 17:55:02 vyos03 heartbeat: [4119]: WARN: node vyos01: is dead
  Jun 23 17:59:23 vyos03 heartbeat: [4119]: CRIT: Cluster node vyos01 
  returning after partition.
  Jun 23 17:59:23 vyos03 heartbeat: [4119]: WARN: Deadtime value may be too 
  small.
  Jun 23 17:59:23 vyos03 heartbeat: [4119]: WARN: Late heartbeat: Node 
  vyos01: interval 273580 ms
  Jun 23 17:59:23 vyos03 harc[4961]: info: Running /etc/ha.d//rc.d/status 
  status
  Jun 23 17:59:25 vyos03 ResourceManager[4991]: info: Releasing resource 
  group: vyos01 IPaddr2-vyatta::10.0.0.10/24/eth1
  Jun 23 17:59:25 vyos03 ResourceManager[4991]: info: Running 
  /etc/ha.d/resource.d/IPaddr2-vyatta 10.0.0.10/24/eth1 stop
  Jun 23 17:59:26 vyos03 heartbeat: [4119]: WARN: 1 lost packet(s) for 
  [vyos01] [421:423]
  Jun 23 17:59:39 vyos03 heartbeat: [4119]: WARN: Logging daemon is disabled 
  --enabling logging daemon is recommended
  Jun 23 17:59:40 vyos03 harc[5102]: info: Running /etc/ha.d//rc.d/status 
  status
  
  Which seem normal in the case of networking problem.. But then later:
  
  Jun 23 19:31:22 vyos03 heartbeat: [10921]: ERROR: Message hist queue is 
  filling up (494 messages in queue)
  Jun 23 19:31:22 vyos03 heartbeat: [10921]: ERROR: Message hist queue is 
  filling up (495 messages in queue)
  Jun 23 19:31:23 vyos03 heartbeat: [10921]: ERROR: Message hist queue is 
  filling up (496 messages in queue)
  Jun 23 19:31:24 vyos03 heartbeat: [10921]: ERROR: Message hist queue is 
  filling up (497 messages in queue)
  Jun 23 19:31:24 vyos03 heartbeat: [10921]: ERROR: Message hist queue is 
  filling up (498 messages in queue)
  Jun 23 19:31:25 vyos03 heartbeat: [10921]: ERROR: Message hist queue is 
  filling up (499 messages in queue)
  Jun 23 19:31:26 vyos03 heartbeat: [10921]: ERROR: Message hist queue is 
  filling up (500 messages in queue)
  Jun 23 19:31:42 vyos03 heartbeat: last message repeated 25 times
  
  
  The hist queue size keeps increasing, and when it gets to 500 messages 
  bad things start happening..
  
  
  Jun 23 19:31:43 vyos03 heartbeat: [10921]: ERROR: Message hist queue is 
  filling up (500 messages in queue)
  Jun 23 19:31:49 vyos03 heartbeat: last message repeated 9 times
  Jun 23 19:31:49 vyos03 heartbeat: [10921]: ERROR: lowseq cannnot be greater 
  than ackseq
  Jun 23 19:31:50 vyos03 heartbeat: [10923]: CRIT: Emergency Shutdown: Master 
  Control process died.
  Jun 23 19:31:50 vyos03 heartbeat: [10923]: CRIT: Killing pid 10921 with 
  SIGTERM
  Jun 23 19:31:50 vyos03 heartbeat: [10923]: CRIT: Killing pid 10924 with 
  SIGTERM
  Jun 23 19:31:50 vyos03 heartbeat: [10923]: CRIT: Killing pid 10925 with 
  SIGTERM
  Jun 23 19:31:50 vyos03 heartbeat: [10923]: CRIT: Emergency Shutdown(MCP 
  dead): Killing ourselves.
  
  At this point clustering has failed, because the heartbeat 
  services/processes aren't running anymore..
  
  Has anyone else seen this? 
 
 It has been fixed years ago ...
 
  It seems the bug gets triggered at 500 messages in the hist queue,
  and then I always see the ERROR: lowseq cannnot be greater than ackseq 
  and then heartbeat dies..
 
 -- 
 : Lars Ellenberg
 : LINBIT | Your Way to High Availability
 : DRBD/HA support and consulting http://www.linbit.com
 
 DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] heartbeat 3.0.3 crashes if there are networking/multicast issues (ERROR: lowseq cannnot be greater than ackseq)

2014-06-26 Thread Lars Ellenberg
On Tue, Jun 24, 2014 at 11:20:48PM +0300, Pasi Kärkkäinen wrote:
 Hello!
 
 I've been seeing heartbeat cluster problems in Linux-based Vyatta and more 
 recent VyOS networking/router appliances.
 These are currently based on Debian Squeeze, and thus are using:
 
 Package: heartbeat
 Version: 1:3.0.3-2

Please use 3.0.5:
http://hg.linux-ha.org/heartbeat-STABLE_3_0/archive/37f57a36a2dd.tar.bz2

 VyOS bug report: http://bugzilla.vyos.net/show_bug.cgi?id=244
 
 The problem is that when there are (unexpected) networking problems causing 
 multicast issues,
 which cause problems in the inter-cluster communications, the heartbeat 
 processes will die on the cluster nodes,
 which is bad, right? I assume heartbeat should never die, especially not 
 because of temporary networking issues..
 
 I've also seen heartbeat dying because of temporary network maintenance 
 breaks..
 
 Basicly first I'm seeing this kind of messages:
 
 Jun 23 17:55:02 vyos03 heartbeat: [4119]: WARN: node vyos01: is dead
 Jun 23 17:59:23 vyos03 heartbeat: [4119]: CRIT: Cluster node vyos01 returning 
 after partition.
 Jun 23 17:59:23 vyos03 heartbeat: [4119]: WARN: Deadtime value may be too 
 small.
 Jun 23 17:59:23 vyos03 heartbeat: [4119]: WARN: Late heartbeat: Node vyos01: 
 interval 273580 ms
 Jun 23 17:59:23 vyos03 harc[4961]: info: Running /etc/ha.d//rc.d/status status
 Jun 23 17:59:25 vyos03 ResourceManager[4991]: info: Releasing resource group: 
 vyos01 IPaddr2-vyatta::10.0.0.10/24/eth1
 Jun 23 17:59:25 vyos03 ResourceManager[4991]: info: Running 
 /etc/ha.d/resource.d/IPaddr2-vyatta 10.0.0.10/24/eth1 stop
 Jun 23 17:59:26 vyos03 heartbeat: [4119]: WARN: 1 lost packet(s) for [vyos01] 
 [421:423]
 Jun 23 17:59:39 vyos03 heartbeat: [4119]: WARN: Logging daemon is disabled 
 --enabling logging daemon is recommended
 Jun 23 17:59:40 vyos03 harc[5102]: info: Running /etc/ha.d//rc.d/status status
 
 Which seem normal in the case of networking problem.. But then later:
 
 Jun 23 19:31:22 vyos03 heartbeat: [10921]: ERROR: Message hist queue is 
 filling up (494 messages in queue)
 Jun 23 19:31:22 vyos03 heartbeat: [10921]: ERROR: Message hist queue is 
 filling up (495 messages in queue)
 Jun 23 19:31:23 vyos03 heartbeat: [10921]: ERROR: Message hist queue is 
 filling up (496 messages in queue)
 Jun 23 19:31:24 vyos03 heartbeat: [10921]: ERROR: Message hist queue is 
 filling up (497 messages in queue)
 Jun 23 19:31:24 vyos03 heartbeat: [10921]: ERROR: Message hist queue is 
 filling up (498 messages in queue)
 Jun 23 19:31:25 vyos03 heartbeat: [10921]: ERROR: Message hist queue is 
 filling up (499 messages in queue)
 Jun 23 19:31:26 vyos03 heartbeat: [10921]: ERROR: Message hist queue is 
 filling up (500 messages in queue)
 Jun 23 19:31:42 vyos03 heartbeat: last message repeated 25 times
 
 
 The hist queue size keeps increasing, and when it gets to 500 messages bad 
 things start happening..
 
 
 Jun 23 19:31:43 vyos03 heartbeat: [10921]: ERROR: Message hist queue is 
 filling up (500 messages in queue)
 Jun 23 19:31:49 vyos03 heartbeat: last message repeated 9 times
 Jun 23 19:31:49 vyos03 heartbeat: [10921]: ERROR: lowseq cannnot be greater 
 than ackseq
 Jun 23 19:31:50 vyos03 heartbeat: [10923]: CRIT: Emergency Shutdown: Master 
 Control process died.
 Jun 23 19:31:50 vyos03 heartbeat: [10923]: CRIT: Killing pid 10921 with 
 SIGTERM
 Jun 23 19:31:50 vyos03 heartbeat: [10923]: CRIT: Killing pid 10924 with 
 SIGTERM
 Jun 23 19:31:50 vyos03 heartbeat: [10923]: CRIT: Killing pid 10925 with 
 SIGTERM
 Jun 23 19:31:50 vyos03 heartbeat: [10923]: CRIT: Emergency Shutdown(MCP 
 dead): Killing ourselves.
 
 At this point clustering has failed, because the heartbeat services/processes 
 aren't running anymore..
 
 Has anyone else seen this? 

It has been fixed years ago ...

 It seems the bug gets triggered at 500 messages in the hist queue,
 and then I always see the ERROR: lowseq cannnot be greater than ackseq and 
 then heartbeat dies..

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems