Postmortem upgrading patchwork.kernel.org 1.0->2.1

Konstantin Ryabitsev Tue, 31 Jul 2018 07:52:29 -0700

Hello, patchwork list!

About a week ago I performed the upgrade of patchwork.kernel.org from version 1.0 to version 2.1. This is my war story, since it didn't go nearly as smooth as I was hoping.


First, general notes to describe the infra:

- patchwork.kernel.org consists of 74 projects and, prior to the migration, 1,755,019 patches (according to count(*) in the patchwork_patch table).

- of these, 750,081 patches are from one single project (LKML).

- the database is MySQL, in a master-slave failover configuration, accessed using haproxy

The LKML project was mostly dead weight, since nobody was actually using it for tracking patches. We ended up creating a separate patchwork instance just for LKML, located here: https://lore.kernel.org/patchwork. The migration, therefore, included two overall steps:

1. delete the LKML project from patchwork.kernel.org, dumping nearly half of db entries.

2. migrate the remainder to patchwork 2.1

# Problems during the first stage

Attempt to delete the LKML project via the admin interface failed. Clicking the "Delete" button on the project settings page basically consumed all the RAM on the patchwork system and OOM-ed in most horrible ways, requiring a system reboot. In an attempt to solve it, I manually deleted all patches from patchwork_patch that belonged to the LKML project. This allowed me to access the "Delete" page and delete the project, though this also resulted in a corrupted session, because my admin profile ended up corrupted. The uwsgi log showed this error:


Traceback (most recent call last):
 File 
"/opt/patchwork/venv/lib/python2.7/site-packages/django/core/handlers/base.py", 
line 132, in get_response
   response = wrapped_callback(request, *callback_args, **callback_kwargs)
 File "./patchwork/views/patch.py", line 106, in list
   view_args = {'project_id': project.linkname})
 File "./patchwork/views/__init__.py", line 59, in generic_list
   if project.is_editable(user):
 File "./patchwork/models.py", line 69, in is_editable
   return self in user.profile.maintainer_projects.all()
 File 
"/opt/patchwork/venv/lib/python2.7/site-packages/django/utils/functional.py", 
line 226, in inner
   return func(self._wrapped, *args)
 File 
"/opt/patchwork/venv/lib/python2.7/site-packages/django/db/models/fields/related.py",
 line 483, in __get__
   self.related.get_accessor_name()
RelatedObjectDoesNotExist: User has no profile.

I was able to create another admin user and continue.

# Problems during the second stage

At this stage, I started the migration process using manage.py migrate. Immediately, this resulted in a problem due to haproxy inactivity timeouts. As I mentioned, our setup uses a master-slave setup, and tcp connections are configured to time out after 5 minutes of inactivity. The migration script was doing some serious database modifications operating on tables with about a million or more rows, which took MUCH longer than our 5-minute timeout setting.

After setting things up to connect directly to the master, bypassing haproxy, I was able to proceed to the next step. Unfortunately, I didn't get very far, since at this point migration routines were failing because they were trying to lock millions of rows and running out of database resources. Unfortunately, I could not easily fix this because raising maximum locks would have required restarting the database server (an operation that affects multiple projects using it). I had to look in the django migration scripts and run mysql queries manually, adding WHERE clauses so that they would operate on subsets of rows (limiting by id ranges). This took a few hours -- to some degree because all operations had to be replicated to the slave server. Some of the tables it operated on were tens of gigabytes in size, so shipping all these replication logs to the slave server also took a lot of resources and resulted in lots of network and disk IO.

In the end, it mostly worked out, despite the somewhat gruelling process. I do have a somewhat mysterious side-effect of deleting the LKML project in that some people lost maintainer status in other projects. I'm not sure how this came to be, but at least it's an easy fix -- probably the same reason my admin profile got purged requiring me to create a new one.

Needless to say, I hope future upgrades are a lot more smooth. Did I test the migration before starting it? Yes, but on a very small subset of test data, which is why I didn't hit any of the above conditions, which stemmed from a) deleting a project and b) copying, merging and deleting huge datasets as the backend db format got rejigged between version 1.0 and 2.1.

My suggestion for anyone performing similar migrations from 1.0 to 2.1 -- if you have more than a few thousand patches -- is to perform this migration on a standalone database and then dump and re-import the result into production instead of performing this live on multi-master or master-slave setups. You will save a LOT of IO and avoid many problems related to table modifications and database resource starving.

If you have any questions or if you'd like me to help troubleshoot something, I'll be quite happy to do that.


Best,
-K

signature.asc
Description: PGP signature

_______________________________________________
Patchwork mailing list
Patchwork@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/patchwork

Postmortem upgrading patchwork.kernel.org 1.0->2.1

Reply via email to