Hi, On 25-Nov-10, at 11:37 AM, Andrew Beekhof wrote:
> Given what you've described, you could probably remove the while loop > during stop. > It should be safe because Amazon is ensuring that it will only "run" > in exactly one location. I'll give that a try -- thanks. I noticed something else interesting during my testing today -- I'm curious if it's related to my testing method or is a sign of a configuration error. To test Pacemaker's response to a node failure, I usually use iptables to cut off all network traffic from one node to the rest of the cluster. (I'm doing this instead of the typical "unplug the network line" method because I don't have physical access to the machines). For example, I would run this on node test2 of a 3 node test environment: "iptables -A INPUT -s test1 -j DROP; iptables -A INPUT -s test3 -j DROP; iptables -A OUTPUT -d test1 -j DROP; iptables -A OUTPUT -d test3 -j DROP" As expected, Pacemaker detects the node failure and starts up all the resources that were running on that node elsewhere. However, when I remove the rules with "iptables -F", there if a brief period where Pacemaker (or Heartbeat, I suppose) becomes very confused as to which nodes are up and which are down. For example, crm_mon will suddenly indicate that test3 is offline, and then show that it is back online ten seconds later, even though test3 was always part of the partition that had quorum. The problem here is that these spurious node failures cause Pacemaker to initiate unnecessary resource migrations. Is it normal for the cluster to become confused for a while when the network connection to a node is suddenly restored? Or is this happening because using iptables is not a fair test of how the system will respond during a network split? Thanks, Andrew _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
