Re: architecture for Blue/Green deployments

Shaheed Haque Thu, 02 May 2019 09:34:05 -0700

Hi Dan,

I recently went through a similar exercise to the one you describe to move
our prototype code on AWS.


First, my background includes a stint building a control plane for
autoscaling VMs on OpenStack (and being generally long in tooth), but this
is my first attempt at a Web App, and therefore Django too. I also grew up
on VAXes, so the notion of an always-up cluster is deeply rooted.

Technical comments follow inline...

On Wed, 1 May 2019 at 21:35, <[email protected]> wrote:

> My organization is moving into the AWS cloud, and with some other projects
> using MongoDB, ElasticSearch and a web application framework that is not
> Django, we've had no problem.
>
> I'm our "Systems/Applications Architect", and some years ago I helped
> choose Django over some other solutions.   I stand by that as correct for
> us, but our Cloud guys want to know how to do Blue/Green deployments, and
> the more I look at it the less happy I am.
>
> Here's the problem:
>
>    - Django's ORM has long shielded developers from simple SQL problems
>    like "SELECT * FROM fubar ..." and "INSERT INTO fubar VALUES (...)" sorts
>    of problems.
>    - However, if an existing "Blue" deployment knows about a column, it
>    will try to retrieve it:
>       - fubar = Fubar.objects.get(name='Samwise')
>    - If a new "Green" deployment is coming up, and we want to wait until
>    Selenium testing has passed, we have the problem of migrations
>
> I really don't see any simple way around a new database cluster/instance
> when we bring up a new cluster, with something like this:
>
>    - Mark the live database as "in maintenance mode".    The application
>    now will not write to the database, but we can also make that user's access
>    read/only to preserve this.
>    - Take a snapshot
>    - Restore the snapshot to build the new database instance/cluster.
>    - Make the new database as "live", e.g. clear "maintenance mode".   If
>    t he webapp user is read-only, they are restored to full read/write
>    permissions.
>    - Run migrations in production
>    - Bring up new auto-scaling group
>
> We are not yet doing auto-scaling but otherwise your description applies
very well to us. Right now, we have a pair of VMs, a "logic" VM hosting
Django, and a "db" VM hosting Postgres (long term, we may move to Aurora
for the database, but we are not there right now). The logic VM is based on
an Ubuntu base image, but a load of extra stuff:

   - Django, our code and all Python dependencies
   - A whole host of non-Python dependencies starting with RabbitMQ (needed
   for Celery), nginx, etc
   - And a whole lot of configuration for the above (starting with
   production keys, passwords and the like)

The net result is that not only does it take 10-15 minutes for AWS to spin
up a new db VM from a snapshot, but it also takes several minutes to spin
gup, install, and configure the logic VM. So, we have a piece of code that
can do a "live-to-<scenario>" upgrade:

   - Where scenario is "live-to-live" or "live-to-test".
   - The logic is the same in both except for a couple of small pieces only
   in the live-to-live case where we:
      - Pause the live system (db and celery queues) before snapshotting it
      for the new spin up
      - Create an archive of the database
      - Switch the Elastic IP on successful sanity test pass
   - We also have a small piece of run-time code in our project/settings.py
   that, on a live system, enables HTTPS and so on.

Before we do the "live-to-live" upgrade, we always to a "live-to-test"
upgrade. This ensure we have run all migrations and pre-release sanities on
virtually current data, and then perform a *separate* live-to-live.

While this works, it creates a window during when the service must be down.
There is also a finite window when all those 3rd party dependencies on apt
and pip/pypi expose the "live-to-live" to potential failure.

So in the "long term", I would prefer to attempt something like the
following:

   - Use a cluster of logic N VMs.
   - Use an LB at the front end.
   - Enforce a development process that ensures that (roughly speaking) all
   database changes result in a new column, and where the old cannot be
   removed until a later update cycle. All migrations populate the new column.
   - We spin up and N+1th VM with the new logic, and once sanity testing is
   passed, switch the N+1 machine on in the LB, and remove one of the original
   N.
      - Loop
   - Delete the old column

Of course, the $64k question is all around how to keep the old logic and
the new logic in sync with the two columns. For that, I can only wave my
arms at present and say that the old column cannot really be there in its
bare form, instead there will be some kind of a view that makes it look
like it is - possibly with some old school stored procedure/trigger logic
in support. Of course, I would love it if there were some magic tooling
developed by the Django and database gurus before I have to tackle this.
Then again, I don't believe in magic. And nor do I believe we'll have an
army of devs to fake the magic.

I'd love to be shown a better way...(e.g. a complete second cluster, with a
rolling migration of data from old to new until the old is killed?) else
I'll be on the hook for making the above work!

Thanks, Shaheed


> Of course, some things that Django does really help:
>
>    - The database migration is first tested by the test process, which
>    runs migrations
>    - The unit tests have succeeded before we try to do this migration.
>
>
> Does anyone have experience/cloud operations design with doing Bluegreen
> deployments with Django?
>
> --
> You received this message because you are subscribed to the Google Groups
> "Django users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/django-users.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/django-users/ae5310c6-b69f-43af-a838-5dce7bd6a712%40googlegroups.com
> <https://groups.google.com/d/msgid/django-users/ae5310c6-b69f-43af-a838-5dce7bd6a712%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/django-users.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-users/CAHAc2jdcYNo%3DqrCW772h-rKJRCdUMsn%2B5tPJH%2BTOGFHGiTedqQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: architecture for Blue/Green deployments

Reply via email to