Le samedi 30 juin 2018 à 18:44 +0530, Nigel Babu a écrit : > Hello, > > I think the various pieces around infra have stabilized enough for us > to > think about this. I suggest that we think about having a Gerrit > replica in > the cloud (whichever clouds the CI consumes). This gives us a fall > back > option in case the cage has problems. It also gives us a good way to > reduce > the CI related load on the main Gerrit server. In the near future, > when we > run distributed testing, we're going to clone 10x as much as we do > now. > Right now we clone over git to take the load away from Gerrit, but > when we > have a replica, I vote we clone over HTTP(s). > > I would also recommend an offsite PostgreSQL replica that will let us > be > somewhat fault tolerant. In the event that cage has a multi-hour > unexplained outage, we'd be able to bring back essential services.
So I have been looking at getting a replica in the cage in HA for postgres, and my memories is that this wasn't that easy, depending on the tradeoff we were willing to make. It get better with recent postgresql version, but given how critical is the DB for gerrit, I am not willing to take a non supported by the distro version. My goal however was to be able to upgrade/reboot postgresql without interruption of services, a more modest goal that full off site HA, since it would cope with the hypervisor being reboot, with postgresql being reboot, so manual switch was a acceptable tradeoff. > This is suggestion. We'll need to estimate the cost of work involved > + cost > of operating both these hot standbys. So in term of cost, that's easy, take the existing VM size and just see how much it cost on rackspace. For postgresq, that's a 2G 1 core VM, with a 8G disk. Gerrit is a bit bigger, 4G or 6G and 2 cores, IIRC, but nothing extraordinary. IMHO, the real cost would be in term of UX. For example, if we want to confirm each transaction, we would need to deal with latency over the internet (since that mean acking the write on the first db, and then on the 2nd), which may impact in turn Gerrit interface responsiveness, if we do thing over the internet. Also, while I am not a distributed system expert, the nature of gerrit storage (aka, partially in a git repo, partially in postgresql db) make each modification non atomic and therefor dangerous if a split occurs at the wrong moment. I guess we might need to see gerrit code to see how it goes. And given that we have no control on the network on rackspace, any automated failover will depend on DNS, which mean this will take 1h to 6h realistically make a change, and likely can't be automated given the current setup of delegating DNS changes to IT inside the VPN. So 1h is best case if sysadmins are ready right after a outage, and depending on the date/hour, it might be more like 12 to 24h. (assuming that's during workday), not taking some issues we have with DNS since 6 months. So if we want something faster/more automated, we would need to expend our footprint to a DC where we control the network, and make sure we can update the route for the IP of gerrit (which we likely can't with the cage, but I will defer to NOC to confirm that). That's why I started to look at replica of postgres in the cage, because we control the network, we have keepalive working well (so no DNS propagation issue), a rather low latency. And my goal didn't include gerrit, so I deferred that for later as well. So I think that rather dealing with complexity of inter DC replication where we do not control much, we should have a more modest and achievable goal of doing it where we control things. So getting gerrit on a 2dn VM in the cage, that requires automatic deployment, getting pg replicated, doing test of failover, decide what/how much we want to achieve. -- Michael Scherer Sysadmin, Community Infrastructure and Platform, OSAS
signature.asc
Description: This is a digitally signed message part
_______________________________________________ Gluster-infra mailing list [email protected] http://lists.gluster.org/mailman/listinfo/gluster-infra
