On 06/06/2016 06:43 PM, Nigel Babu wrote:
Hello folks, Here's a postmortem for the Gerrit migration issues # Timeline of Events May 25 - Test migration to PostgreSQL May 27 - Migration to PostgreSQL on production May 30 - Staging with Gerrit 2.12.2 available for testing Jun 01 - Gerrit upgrade complete ( 0310) Jun 01 - First notification of login issues (0628) Jun 03 - Fix applied on test server and email sent out to affected users to test. Jun 06 - Fix applied on production server # Problems Over the years, Gerrit has changed how it handles user accounts. In Gerrit 2.9, the Github plugin allowed users to sign up and then set their username. As we upgraded to 2.12, we discovered that your username defaults to your Github username. In our instance, this affects a small subset of people. Additionally, very few people who were affected by this bug actually tested out the staging instance (only one person ran into the bug). Even in production, only those users who signed out of review.gluster.org after the upgrade were actually affected. There were quite a few users who were affected who did not realize they were affected because they didnt log out. By the time the issues were reported, we were a few hours into our upgrade and rollback wasn't an option anymore. # Solution The first preference was given to checking if we had an easy fix. I looked at a different plugin for Github authentication[1]. This plugin claims to allow users to map different external identities onto on Gerrit user. I timeboxed this testing down to a few hours and I found that it wasn't working during my limited testing. Now that quick fixes were eliminated, I spent some time diagnosing the issue in detail. In the meanwhile, I reached out to the gerrit mailing list for help. From conversations with Michael and Raghavendra, I learned that we've run into problems like this before and Justin has fixed them. I reached out to Justin as well for help. By the next morning (Jun 2), I had a good idea of what was wrong and a few ideas on how to fix them. Justin had gotten back to me as well, so I had more information to confirm my diagnosis. People who had Github username the same as their gerrit username had no issues. Some people had a completely different Github username from their gerrit username. And some people had multiple usernames against the same account_id (one of them matching their Github account). The older version of Gerrit + Github plugin seemed to tolerate both of these situations. The newer version was less forgiving about this inconsistency. When I removed the entry in accounts_external_ids which corresponded to gerrit:<github-username>, on next login, a new account would be created for those who had issues. This was the safest method of all. However, this had the side effect that the new user would have none of the history of the old one. I tried renaming the username, for which the side effects were unknown, but also seemed to work. This meant that users would retain their history, but their first git push/pull would fail until they changed the clone path. I checked with the Gerrit mailing list about side effects of renaming usernames. There are side effects, but it doesn't affect our particular use case, so we were free to do so. On Friday (Jun 3), I wrote a sql script to update everyone's accounts to a consistent state. If you had different username in gerrit from Github, your gerrit username would be changed to match your Github account. If you had multiple usernames, only the one matching the Github username would be kept. I ran it and I emailed everyone this affected to test logging in and doing reviews. Huge thanks to Niels, Prashanth, Jiffin, and others for testing the instance and reporting issues they came across. On Monday (Jun 6), I backed up the database and ran this script in production. We had a few people have issues pushing/pulling, but everyone has now figured out the changes they need to make in the .git/config to get things working. # What we Learned * Gerrit's ssh-based flush-cache[2] command needs to be used after changing anything in the user table. * After a Gerrit restart, it takes a bit for login to start working again. This time period depends on the machine's CPU/RAM. Much lower on production machine. * We have a reasonably good idea about Gerrit's accounts_external_ids table. # What Went Well * We had a fix deployed within 3 working days from the reporting the issue. * We've cleared out any repeat of this particular issue in the future. * This instance is documented very well including the different approaches and their outcomes. # What Went Badly * We did not have documentation of previous Gerrit upgrade issues. * Not enough testing of the new version of Gerrit and not enough time. * When issues were noticed, the rollback plans were non-viable. We'd like to be in a place where we should catch these in staging or at least soon enough in production that we can rollback. # Notes for Future * Document previous issues and post-mortems. I will be working on creating a place for this. This post-mortem and any future ones will be available in a public place.
Github issues or bugzilla, imho.
* Dogfood Gerrit. Most of the code other than our project code goes directly into Github. I would like for new projects that I maintain to be running and reviewed on Gerrit with a replication to Github. * Establish an official staging site for gerrit. * Establish a week long testing period before every upgrade with a small team of volunteers. * Have a small team of developers be around during upgrades, so we can do immediate tests of the upgrade. [1] https://github.com/davido/gerrit-oauth-provider [2] https://gerrit-review.googlesource.com/Documentation/cmd-flush-caches.html -- nigelb _______________________________________________ Gluster-infra mailing list [email protected] http://www.gluster.org/mailman/listinfo/gluster-infra
_______________________________________________ Gluster-infra mailing list [email protected] http://www.gluster.org/mailman/listinfo/gluster-infra
