On Thu, Apr 22, 2010 at 04:35:08PM -0500, David Teigland wrote: > On Thu, Apr 22, 2010 at 11:06:19AM +1000, Angus Salkeld wrote: > > Problem: > > > > Under certain circumstances cpg does not send group leave messages. > > > > With a big token timeout (tested with token == 5min). > > 1 start all nodes > > 2 start ./test/testcpg on all nodes > > 2 go to the node with the lowest nodeid > > 3 ifconfig <int> down && killall -9 corosync && /etc/init.d/corosync > > restart && ./testcpg > > 4 the other nodes will not get the cpg leave event > > 5 testcpg reports an extra cpg group (basically one was not removed) > > > > Solution: > > If a member gets removed using the new trans_list and > > that member is the node used for syncing (lowest nodeid) > > then the next lowest node needs to be chosen for syncing. > > > > David would you mind confirming that this solves your problem? > > It works great, thanks!
That was after two tests, and it may have been a bit hasty... when I went back to do some further tests, I happened to make a slight mistake running the usual steps, and the node failure then went unnoticed like before. When repeating the "mistake" intentionally, I get the same problem. This new test is: 1 nodes 1,2,3,4: cman_tool join 2 create iptables partition: 1 | 2,3,4 3 node 1: kill -9 corosync 4 remove iptables partition: 1,2,3,4 5 node 1: cman_tool join 6 nodes 1,2,3,4: fenced; fence_tool join 7 create iptables partition: 1 | 2,3,4 8 node 1: kill -9 corosync 9 remove iptables partition: 1,2,3,4 10 node 1: cman_tool join 11 no confchg removing 1 from the fenced cpg on nodes 2,3,4 Dave _______________________________________________ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais