Re: [Linux-HA] corosync communication stops after link down

2014-09-29 Thread Matthias Ferdinand
On Fri, Sep 26, 2014 at 12:00:04PM -0600, linux-ha-requ...@lists.linux-ha.org 
wrote:
 Message: 1
 Date: Fri, 26 Sep 2014 14:41:41 +0200
 From: Helmut Wollmersdorfer helmut.wollmersdor...@fixpunkt.de
 To: General Linux-HA mailing list linux-ha@lists.linux-ha.org
 Subject: Re: [Linux-HA] corosync communication stops after link down
 Message-ID: 1b2fbdf7-c012-4296-8d51-859749207...@fixpunkt.de
 Content-Type: text/plain; charset=us-ascii
 
 
 Am 24.09.2014 um 22:35 schrieb Matthias Ferdinand m...@14v.de:
 
  OS: Ubuntu 14.04 64bit
  corosync: 2.3.3-1ubuntu1
  2 nodes
  2 rings (em1, bond0(p2p1,p1p1)) rrp_mode: active,
 all with crossover cables, no switches
  transport: udpu
 
 
 So, this bug 
 
 https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=746269
 
 https://bugzilla.redhat.com/show_bug.cgi?id=821352
 
 is solved in your version of corosync? It must, because the cross-over 
 point-to-point connection would always fail.

these bug reports are for corosync 1.x and point-to-point interfaces,
so they don't apply to our config (corosync 2.x with standard ip subnets
on the crossover connections).

  This happened on two different cluster installs with rougly the same
  hardware (Dell Poweredge R520 resp. R420, onboard Broadcom BCM5720 (em1),
  2x2port Intel I350 (p2p1,p1p1)).
 
 Looks like a software or configuration problem.
 
 Here are 2 x R510 and 2 x R520 with Debian, DRD, XEN, Corosync, Pacemaker.
 
 Hmm, do you have 2 extension cards with dual port in each node? There was a 
 bug in the kernel modules, maybe this is a regression or related. I had to 
 remove the second card.

yes, it is 2x dual-port. I might try to remove the second card, but we
really need all the ports. Do you have details about the kernel/module
bug? 

With Ubuntu 12.04, we had to manually install the current Intel driver
modules, otherwise the kernel never saw a link up.

Regards
Matthias
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] corosync communication stops after link down

2014-09-26 Thread Helmut Wollmersdorfer

Am 24.09.2014 um 22:35 schrieb Matthias Ferdinand m...@14v.de:

 OS: Ubuntu 14.04 64bit
 corosync: 2.3.3-1ubuntu1
 2 nodes
 2 rings (em1, bond0(p2p1,p1p1)) rrp_mode: active,
all with crossover cables, no switches
 transport: udpu


So, this bug 

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=746269

https://bugzilla.redhat.com/show_bug.cgi?id=821352

is solved in your version of corosync? It must, because the cross-over 
point-to-point connection would always fail.


 If the cluster is up for some time (here: ~ 1 week), and one node is
 rebooted, corosync on the surviving node (no-carrier on all
 corosync-related interfaces) does not resume
 sending packets when links go up again after peer finished rebooting
 (3-4 minutes link down; tcpdump on both nodes and both em1 and bond0
 show: no packets from the surviving node). The rebooted node then cannot
 see any neighbor and consequently decides to stonith the peer before
 starting resources. But the resources still cannot run until the
 stonith'd node is completely rebooted, because the drbd volumes became
 outdated at shutdown -r now time.
 
 Subsequent reboots do not show any problems. Repeat after ~ 1 week
 uptime, and the problem shows up again.
 
 This happened on two different cluster installs with rougly the same
 hardware (Dell Poweredge R520 resp. R420, onboard Broadcom BCM5720 (em1),
 2x2port Intel I350 (p2p1,p1p1)).

Looks like a software or configuration problem.

Here are 2 x R510 and 2 x R520 with Debian, DRD, XEN, Corosync, Pacemaker.

Hmm, do you have 2 extension cards with dual port in each node? There was a bug 
in the kernel modules, maybe this is a regression or related. I had to remove 
the second card.

HTH

Helmut Wollmersdorfer






___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems