I had a quick chat with Nick Moffitt and Liam Young of webops/GSA, then Tristram of the web team, which I think would be useful to all of you.
As we are standardising around a model of using Django 1.5 for the individual components (as defined in lp:ubuntu-ci-services-itself, docs/style.rst), it's worth thinking about the various ways any one of these components can fail. A broader discussion would be of what happens when a component completely goes down and cannot be talked to. What does the other end do in this circumstance to gracefully handle the failed request and prevent a domino effect? We cannot assume that the REST API we're talking to will reply, or reply within a given timeout (and we should always be setting timeouts). I won't cover this here, but you should definitely be thinking about how to handle it. So, how can our little Django worker fail? Well, for a start, the node it is running on could fall over. That's okay, Django itself is horizontally scalable. So we create N wsgi servers (gunicorn) hosting the Django code and put them behind HAProxy with a health check set. With a bit of extra work (we cannot just juju upgrade-charm), this would also let us deploy code worker by worker, checking for a bad deployment along the way. The online services team is trying to get to this deployment strategy in place. It's worth talking to bloodearnest if you head down that road. But Django also talks to a Postgres database. How do we handle Postgres falling over and leaving Django with nothing to talk to? Pgbouncer helps here. If we put pgbouncer in front of a number of postgres instances with a set master instance, we can tolerate some fallover. Of course, pgbouncer then becomes a SPOF. From talking to Nick it doesn't sound like this has bitten IS often. It's definitely worth talking to Stuart Bishop (stub) about how to best handle postgres in this SOA architecture. He's our in house database expert. Now, replicating postgres like this potentially falls over if we're using it to store locks. You've got to wait on pgbouncer to synchronise locks across all the postgres nodes. Also keep in mind whether we really need to store anything in a database at all. If you're talking to Launchpad for your information, you can probably leave the data there. If you're creating locks, it's probably worth rethinking whether you can flip that around and rather than go find a place to put a task, whether you can put it on a big queue for some workers to grab from. Expanding on that, just how much of Django do you really need? I can't imagine we'll need the administrative interface, the templating engine, the ORM, or really anything above the routing code in most cases. It's probably worth disabling the rest. Django is pretty heavyweight. Tristram benchmarked it against Flask and others and came up with some interesting results: https://workflowy.com/shared/1574979c-4603-a345-a145-a6dbb7174885/ Unfortunately, the Preferred Technologies page pretty much forces us to use it, but that doesn't mean we cannot strip it down to just what we need in each case. Attached is the diagram Nick and Liam drew for how we might layout each component. Keep in mind this is for a single microrservice. We'd want this layout for each one. You can ignore the bit at the top for squid. We won't need that on the front of most things. Instead, a simple Apache in front of HAProxy will suffice. For good examples of how to do haproxy in prodstack, both psearch (in lp:ubuntuone-servers-deploy) and certification (in lp:~canonical-losas/canonical-is-charms/certification) were recommended. Thanks! (Tristram, Liam, and Nick, if I got any of the above wrong, please do correct me.)
-- Mailing list: https://launchpad.net/~canonical-ci-engineering Post to : [email protected] Unsubscribe : https://launchpad.net/~canonical-ci-engineering More help : https://help.launchpad.net/ListHelp

