Am 24.09.2014 um 22:35 schrieb Matthias Ferdinand <m...@14v.de>: > OS: Ubuntu 14.04 64bit > corosync: 2.3.3-1ubuntu1 > 2 nodes > 2 rings (em1, bond0(p2p1,p1p1)) rrp_mode: active, > all with crossover cables, no switches > transport: udpu
So, this bug https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=746269 https://bugzilla.redhat.com/show_bug.cgi?id=821352 is solved in your version of corosync? It must, because the cross-over point-to-point connection would always fail. > If the cluster is up for some time (here: ~ 1 week), and one node is > rebooted, corosync on the surviving node (no-carrier on all > corosync-related interfaces) does not resume > sending packets when links go up again after peer finished rebooting > (3-4 minutes link down; tcpdump on both nodes and both em1 and bond0 > show: no packets from the surviving node). The rebooted node then cannot > see any neighbor and consequently decides to stonith the peer before > starting resources. But the resources still cannot run until the > stonith'd node is completely rebooted, because the drbd volumes became > outdated at "shutdown -r now" time. > > Subsequent reboots do not show any problems. Repeat after ~ 1 week > uptime, and the problem shows up again. > > This happened on two different cluster installs with rougly the same > hardware (Dell Poweredge R520 resp. R420, onboard Broadcom BCM5720 (em1), > 2x2port Intel I350 (p2p1,p1p1)). Looks like a software or configuration problem. Here are 2 x R510 and 2 x R520 with Debian, DRD, XEN, Corosync, Pacemaker. Hmm, do you have 2 extension cards with dual port in each node? There was a bug in the kernel modules, maybe this is a regression or related. I had to remove the second card. HTH Helmut Wollmersdorfer _______________________________________________ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems