Re: [Linux-HA] corosync communication stops after link down

Helmut Wollmersdorfer Fri, 26 Sep 2014 05:43:04 -0700

Am 24.09.2014 um 22:35 schrieb Matthias Ferdinand <m...@14v.de>:

> OS: Ubuntu 14.04 64bit
> corosync: 2.3.3-1ubuntu1
> 2 nodes
> 2 rings (em1, bond0(p2p1,p1p1)) rrp_mode: active,
>        all with crossover cables, no switches
> transport: udpu



So, this bug 

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=746269

https://bugzilla.redhat.com/show_bug.cgi?id=821352

is solved in your version of corosync? It must, because the cross-over 
point-to-point connection would always fail.


> If the cluster is up for some time (here: ~ 1 week), and one node is
> rebooted, corosync on the surviving node (no-carrier on all
> corosync-related interfaces) does not resume
> sending packets when links go up again after peer finished rebooting
> (3-4 minutes link down; tcpdump on both nodes and both em1 and bond0
> show: no packets from the surviving node). The rebooted node then cannot
> see any neighbor and consequently decides to stonith the peer before
> starting resources. But the resources still cannot run until the
> stonith'd node is completely rebooted, because the drbd volumes became
> outdated at "shutdown -r now" time.
> 
> Subsequent reboots do not show any problems. Repeat after ~ 1 week
> uptime, and the problem shows up again.
> 
> This happened on two different cluster installs with rougly the same
> hardware (Dell Poweredge R520 resp. R420, onboard Broadcom BCM5720 (em1),
> 2x2port Intel I350 (p2p1,p1p1)).

Looks like a software or configuration problem.

Here are 2 x R510 and 2 x R520 with Debian, DRD, XEN, Corosync, Pacemaker.

Hmm, do you have 2 extension cards with dual port in each node? There was a bug 
in the kernel modules, maybe this is a regression or related. I had to remove 
the second card.

HTH

Helmut Wollmersdorfer






_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] corosync communication stops after link down

Reply via email to