On 04/18/2013 06:18 PM, eXeC001er wrote: > > > 2013/4/17 Fabio M. Di Nitto <[email protected] > <mailto:[email protected]>> > > On 4/17/2013 3:57 PM, eXeC001er wrote: > > Hello. > > > > I have tried to create the following demo-cluster to check how work > > MasterWins logic: > > > > NODE1 (VM) > > |========== tap0 (host) > > NODE2 (VM) > > |=============br0(host) > > NODE3 (VM) > > |========== tap1 (host) > > NODE3 (VM) > > > > > > To simulate 50/50 split i just remove "tap1" from "br0". > > > > before split i have the following on all nodes > > > > ---------------------- > > Quorate: Yes > > Nodeid Votes Qdevice Name > > 1 1 A,V,MW 172.18.251.41 > > 2 1 A,NV,MW 172.18.251.42 (local) > > 3 1 NA,NV,MW 172.18.251.43 > > 4 1 A,NV,MW 172.18.251.44 > > 0 3 QDEV > > > > ---------------------- > > > > after split > > > > on NODE1 and NODE2 i see > > > > ---------------------- > > Quorate: Yes > > Nodeid Votes Qdevice Name > > 1 1 A,V,MW 172.18.251.41 (local) > > 2 1 A,NV,MW 172.18.251.42 > > 0 3 QDEV > > ---------------------- > > > > on NODE2 and NODE3 i see > > > > ---------------------- > > Quorate: No > > Nodeid Votes Qdevice Name > > 3 1 A,NV,MW 172.18.251.43 > > 4 1 A,NV,MW 172.18.251.44 (local) > > 0 3 QDEV > > ---------------------- > > > > So everything fine and MasterWins works as designed. > > > > But after check i tried to restore network connection and added "tap1" > > to "br0". I see that all nodes can ping to each other. but corosync > > still show me 50/50 split. > > > > tcpdump: > > ..................... > > 17:49:36.387217 IP 172.18.251.43.5404 > 172.18.251.44.5405: UDP, > length 74 > > 17:49:36.387441 IP 172.18.251.44.5404 > 172.18.251.43.5405: UDP, > length 74 > > 17:49:36.447590 IP 172.18.251.41.5404 > 172.18.251.42.5405: UDP, > length 74 > > 17:49:36.447811 IP 172.18.251.42.5404 > 172.18.251.41.5405: UDP, > length 74 > > 17:49:36.568557 IP 172.18.251.43.5404 > 172.18.251.44.5405: UDP, > length 74 > > 17:49:36.568804 IP 172.18.251.44.5404 > 172.18.251.43.5405: UDP, > length 74 > > 17:49:36.587829 IP 172.18.251.43.5404 > 239.255.1.1.5405: UDP, > length 87 > > 17:49:36.628254 IP 172.18.251.41.5404 > 172.18.251.42.5405: UDP, > length 74 > > 17:49:36.628442 IP 172.18.251.42.5404 > 172.18.251.41.5405: UDP, > length 74 > > 17:49:36.648323 IP 172.18.251.41.5404 > 239.255.1.1.5405: UDP, > length 87 > > ........................ > > > > > > Any ideas ? > > > > Beside the missing logs that might show something, I have tested this > scenario plenty times but using iptables instead. > > I wonder if you have found a bug in the bridging code. > > I suggest you try the following test instead: > > 4 nodes, without qdisk, try to repeat your bridge remove/add test > > 4 nodes, without qdisk, use iptables instead (make sure block mcast > traffic too) > > then again with qdisk + iptables. > > > have tried with IPTABLES. everything nice. > > But in any case it is very strange, because after the nework connection > has been restored and i restart corosync on ALL nodes my "cluster" works. > > logs do not contain anything intresting. latest records after 50/50 > split just say that some memeber have left. after restoring the > connection no new records in the logfile. > > Also it is very strange that to restore whole cluster i need to restart > corosync on ALL nodes. If restart the corosync only on 3/4 node then > corosync on each node does not see any other nodes.
This sounds like a bug in the multicast bridge code in the kernel that does not rebind the groups in the bridge/switch, and I suspected that because I tested the same scenario with iptables before and corosync behaves as expected. I suggest you talk to network kernel guys. corosync won't attempt to rejoin the group since it's already binded. Fabio _______________________________________________ Openais mailing list [email protected] https://lists.linuxfoundation.org/mailman/listinfo/openais
