On Fri, Nov 26, 2010 at 9:36 AM, Andrew Miklas <[email protected]> wrote: > Hi, > > On 25-Nov-10, at 11:37 AM, Andrew Beekhof wrote: > >> Given what you've described, you could probably remove the while loop >> during stop. >> It should be safe because Amazon is ensuring that it will only "run" >> in exactly one location. > > I'll give that a try -- thanks. > > > I noticed something else interesting during my testing today -- I'm > curious if it's related to my testing method or is a sign of a > configuration error. To test Pacemaker's response to a node failure, > I usually use iptables to cut off all network traffic from one node to > the rest of the cluster. (I'm doing this instead of the typical > "unplug the network line" method because I don't have physical access > to the machines). > > For example, I would run this on node test2 of a 3 node test > environment: > "iptables -A INPUT -s test1 -j DROP; iptables -A INPUT -s test3 -j > DROP; iptables -A OUTPUT -d test1 -j DROP; iptables -A OUTPUT -d test3 > -j DROP" > > As expected, Pacemaker detects the node failure and starts up all the > resources that were running on that node elsewhere. However, when I > remove the rules with "iptables -F", there if a brief period where > Pacemaker (or Heartbeat, I suppose) becomes very confused as to which > nodes are up and which are down. For example, crm_mon will suddenly > indicate that test3 is offline, and then show that it is back online > ten seconds later, even though test3 was always part of the partition > that had quorum.
Yeah, thats heartbeat I'm afraid. > > The problem here is that these spurious node failures cause Pacemaker > to initiate unnecessary resource migrations. Is it normal for the > cluster to become confused for a while when the network connection to > a node is suddenly restored? Its normal for the CCM (part of heartbeat) and used to be normal for corosync. These days I think corosync does a better job in these scenarios. > Or is this happening because using > iptables is not a fair test of how the system will respond during a > network split? Unless you've got _really_ quick hands, you're also creating an asymmetric network split. ie. A can see B but B can't see A. This would be causing additional confusion at the messaging level. > > > Thanks, > > > Andrew > > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems > _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
