On 18/06/2013, at 7:19 AM, Jon Eisenstein <j...@animoto.com> wrote: > tl;dr summary: On EC2, we can't reuse IP addresses, and we need a reliable, > scriptable procedure for replacing a dead (guaranteed no longer running) > server with another one without needing to take the remaining cluster members > down.
This is almost certainly the wrong approach. Have you tried their Virtual Private Network feature? This allows for the use of predictable IPs. > > > I'm trying to build a Pacemaker solution using Percona Replication Manager > (https://github.com/jayjanssen/Percona-Pacemaker-Resource-Agents/blob/master/doc/PRM-setup-guide.rst) > within our EC2 environment. Essentially, the architecture would be 3 > independent MySQL servers, running in different data centers, each of which > runs Pacemaker/Corosync with an agent that manages master/slave replication. > > I have a script that builds a new instance from the base OS, which installs > the cluster software, generates the appropriate config files, and loads the > CRM configuration on boot. This is the method we use to launch servers; in > the event that a server dies, we don't attempt to recover it. Instead, we > launch an entirely new instance (possibly even in a different data center), > which corresponds to building a brand new server, assigning it a new private > IP address. (Every server has a private IP address that directs traffic > within the data center, and a public address that leaves the cloud only to > come back in, introducing security implications, latency and additional > cost.) Ideally, the boot script should be able to handle everything on its > own -- we should be able to create the instance, and by the time it's > finished running, the new box should be in the cluster as a slave, taking the > place of whichever one had previously died. > > The problem I'm running into is that because we're on EC2, we don't control > our IP address allocation. If we did, we'd start a new server with the same > IP as the one that it's replacing, and my understanding is that Pacemaker > would pick right back up and let it join the cluster. Instead, because it has > a new IP, we always end up in a split-brain situation, where the two original > members of the cluster see each other but think the third is down, and the > new one thinks it's the first member of a new cluster with two members that > are down. The only way I've found to correct this is to stop > pacemaker/corosync on all instances, regenerate the config files, and start > them up again. This is not really an ideal scenario. > > Does anyone have any experience or suggestions with working in this kind of > situation? Moving off of EC2 is not an option; creating a private network > (Amazon VPC) so that we can get static addresses has performance implications > we'd rather avoid. Any ideas for solutions or reliable workarounds, > especially if they can be scriptable, would be extremely helpful. (That is, > we won't have any process that automatically replaces a server after one goes > down, but we would like to be able to have the chef boot script, which is > kicked off manually, be able to go from software installation to rejoining > the cluster automatically.) > > > Some options we have available, along with some things we've tried: > > - We can create DNS entries for the three servers by known names (i.e. > mysql-01, mysql-02, mysql-03) which point to the private network IP > addresses. We can put those hostnames into the config files, or we can > resolve them at boot time and put the IP addresses directly. However, this > requires that all three servers be online before running the installation > scripts on any one box. The ideal solution would use only hostnames and > re-resolve the IP any time the cluster needs to configure membership, thus > letting any new server take over the DNS entry but not the IP address. > > - We can create an Elastic IP, which provides a static public IP even before > any of the servers are running. This way, the config can always reference > that IP, and always be accessible, but requires the traffic going to that IP > leave the cloud, which we'd like to avoid. Given that pacemaker/corosync is > relatively low traffic, however, having only those services run over the > public IP would be acceptable; however, so far that has not seemed to solve > our split-brain problem. > > - We can always ensure that there is only one server corresponding to one of > the DNS entries at any given time. (That is, no running server thinks that > it's mysql-02 if we launch another one with the same name.) > > - We can regenerate the corosync.conf at any time without requiring the > services to be stopped, if it's possible to have that config take effect > without a service restart. > > - We can always determine the current IPs of all members from external > scripts via DNS. > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org