On 18/06/2013, at 7:19 AM, Jon Eisenstein <j...@animoto.com> wrote:

> tl;dr summary: On EC2, we can't reuse IP addresses, and we need a reliable, 
> scriptable procedure for replacing a dead (guaranteed no longer running) 
> server with another one without needing to take the remaining cluster members 
> down.

This is almost certainly the wrong approach.
Have you tried their Virtual Private Network feature?  This allows for the use 
of predictable IPs.

> 
> 
> I'm trying to build a Pacemaker solution using Percona Replication Manager 
> (https://github.com/jayjanssen/Percona-Pacemaker-Resource-Agents/blob/master/doc/PRM-setup-guide.rst)
>  within our EC2 environment. Essentially, the architecture would be 3 
> independent MySQL servers, running in different data centers, each of which 
> runs Pacemaker/Corosync with an agent that manages master/slave replication.
> 
> I have a script that builds a new instance from the base OS, which installs 
> the cluster software, generates the appropriate config files, and loads the 
> CRM configuration on boot. This is the method we use to launch servers; in 
> the event that a server dies, we don't attempt to recover it. Instead, we 
> launch an entirely new instance (possibly even in a different data center), 
> which corresponds to building a brand new server, assigning it a new private 
> IP address. (Every server has a private IP address that directs traffic 
> within the data center, and a public address that leaves the cloud only to 
> come back in, introducing security implications, latency and additional 
> cost.) Ideally, the boot script should be able to handle everything on its 
> own -- we should be able to create the instance, and by the time it's 
> finished running, the new box should be in the cluster as a slave, taking the 
> place of whichever one had previously died.
> 
> The problem I'm running into is that because we're on EC2, we don't control 
> our IP address allocation. If we did, we'd start a new server with the same 
> IP as the one that it's replacing, and my understanding is that Pacemaker 
> would pick right back up and let it join the cluster. Instead, because it has 
> a new IP, we always end up in a split-brain situation, where the two original 
> members of the cluster see each other but think the third is down, and the 
> new one thinks it's the first member of a new cluster with two members that 
> are down. The only way I've found to correct this is to stop 
> pacemaker/corosync on all instances, regenerate the config files, and start 
> them up again. This is not really an ideal scenario.
> 
> Does anyone have any experience or suggestions with working in this kind of 
> situation? Moving off of EC2 is not an option; creating a private network 
> (Amazon VPC) so that we can get static addresses has performance implications 
> we'd rather avoid. Any ideas for solutions or reliable workarounds, 
> especially if they can be scriptable, would be extremely helpful. (That is, 
> we won't have any process that automatically replaces a server after one goes 
> down, but we would like to be able to have the chef boot script, which is 
> kicked off manually, be able to go from software installation to rejoining 
> the cluster automatically.)
> 
> 
> Some options we have available, along with some things we've tried:
> 
> - We can create DNS entries for the three servers by known names (i.e. 
> mysql-01, mysql-02, mysql-03) which point to the private network IP 
> addresses. We can put those hostnames into the config files, or we can 
> resolve them at boot time and put the IP addresses directly. However, this 
> requires that all three servers be online before running the installation 
> scripts on any one box. The ideal solution would use only hostnames and 
> re-resolve the IP any time the cluster needs to configure membership, thus 
> letting any new server take over the DNS entry but not the IP address.
> 
> - We can create an Elastic IP, which provides a static public IP even before 
> any of the servers are running. This way, the config can always reference 
> that IP, and always be accessible, but requires the traffic going to that IP 
> leave the cloud, which we'd like to avoid. Given that pacemaker/corosync is 
> relatively low traffic, however, having only those services run over the 
> public IP would be acceptable; however, so far that has not seemed to solve 
> our split-brain problem.
> 
> - We can always ensure that there is only one server corresponding to one of 
> the DNS entries at any given time. (That is, no running server thinks that 
> it's mysql-02 if we launch another one with the same name.)
> 
> - We can regenerate the corosync.conf at any time without requiring the 
> services to be stopped, if it's possible to have that config take effect 
> without a service restart.
> 
> - We can always determine the current IPs of all members from external 
> scripts via DNS.
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to