[migration] Making the forums and wiki cut-over

Terry Ellison Thu, 25 Aug 2011 08:59:15 -0700

Sorry in advance. I've tried to put this type of content on the cwikiand it has largely been ignored by the DL, hence this email. It is along onem so have I tried to structure it and keep to simple plain text(as the DL forwarder seems to screw up text/html markup).

Since this is so long, any reply threads could kill us. So can I askrespondents to open up a new [migration] thread to discuss specificpoints in depth rather than replying directly to this.


*BACKGROUND*

As you all know, I've been doing work on the forums and wiki, though thewiki has been my main focus recently. I am acting in two roles here (i)the lead SysAdmin for the ooo-wiki and ooo-forum VMS; (ii) the lead (andcurrently only) application maintainer on both systems. I also havessh access to the current prod systems running in Oracle infrastructureand do equivalent roles there. So in practice, I am doing all of thisrelated work, including liaising with the project, the infrastructureteam and with Andrew Rist who represents Oracle here.

I understandably have to work within the practices of the Apacheinfrastructure team to retain my permit to act as SysAdmin on these twoVMs. I am also trying to meet the expectations of the project, theinfrastructure teams and the needs of our user population across all ofthis work.

We seem to have a Catch-22 here, and this email is about how we breakthis and move these aspects of the project forward. My interpretationof this Catch-22 is that whilst our current interactions on the DL are agood basis for individuals articulating views on a particular thread(and some seem to generate hundreds of viewpoints) we have nofunctioning mechanism to move to, and adopt some form of, a consensuspolicy or decision. The exact project requirements for this wiki, theseforums and the [email protected] mail forwarders are cases in point.However, the infrastructure team believe that we, the project, have anurgency about making this cut over from Oracle to Apache infrastructure,and are pressing me to make progress.

I can't execute any plan without a baseline requirement and set ofassumptions, so what this note attempts is to lay down such a set, andthe decisions that need to be made to go forward. So PLEASE, I don'twant any flames about my use of DECISION below. What I simply mean isthe if the PPMC as a body accepts these, then I will try my best to movethis work forward. Of course you are free to challenge / change any ofthis if that is a PPMC voted decision, but in this case I need to moveinto a different mode; to suspend work and stop the clock until we havean PPMC-endorsed baseline to replan on. I am NOT going to press onwithout broad endorsement and then be criticised in retrospect for doingso.


*INFRASTRUCTURE DRIVERS*

The infrastructure team has a policy of bringing in new services atcurrent S/W versions whenever possible -- simply because it makes iteasier to support then, and doing this before the service is on-lineinvolves less work and risk that when it is in production. I understandand agree with this goal even though it can front-load work.

* The infrastructure stack is base on a standard Ubuntu server LAMPstack as at current LTS (Ubuntu 10.04-3 LTS) which included PHP 5.3.2

* The forums are stable, but at an N-1 release level. (phpBB 3.0.8vs. 3.0.9).

* *DECISION*: Upgrade the ooo-forums phpBB app + customisations to 3.0.9before go-live. (Based on my last 5 upgrades, this 1-2 days work, themain part being the regression of a 1K line customisation patch when werebaseline the package from 3.0.8 -> 3.0.9)

* The prod wiki is v1.15.1 that at an N-3 major release level (that's30 months old: two major and 10 minor revisions behind the currentsupported). This also runs on PHP 5.2.0.

* We need an reverse-proxy HTTP cache for performance reasons on thewiki. One of the four market leaders in this niche is another Apacheproject: Apache Traffic Server (ATS). It makes sense to stay "in-house"here for both support and referenceability reasons

* *DECISION*: Adopt ATS v3.0.1 as the HTTP cache for the wiki. (BTW,this work has been done and the product is excellent).

The PHP 5.3 introduced extra checking to remove an area of tolerance thePHP 5.x<3 allowed. This was to do with when and how parameters can bepassed by reference under curtain circumstances. So moving a code basefrom 5.2 to 5.3 involved a lot of work identifying and eliminating thismis-codings. This was done by the MediaWiki team in MW v1.16. I hadplanned to move to MW v1.15.5 (the last stable 1.15.x) as our baselineand I've done this work integrating it with Apache Traffic Server (ATS)and our LAMP stack. This is stable and performant enough to show thatwe are good. However, I have only identified and bug-fixed the mainpath 5.2->5.3 coding issues. During my testing I have subsequentlydiscovered others and there are undoubtedly more to find. I've alsodiscussed this with the MW devs on the MW IRC channel. Given this, theconsensus in the @infra team (me included) is that we should bite thebullet now and move to current MW 1.17.0 even given the extra work.There are some performance risks associated with MW 1.17.0 which we needto mitigate. However, given that we've already got a completeLAMP+ATS+MV in an ESX hosted VM performing like a dingbat, we reallyonly face the 1.15.5 -> 1.17.0 issues in this step.

* *DECISION*: Upgrade the ooo-wiki MediaWiki(MV) + all extensions to MWv1.17.0.

* *DECISION*: I have agreed with infrastructure that we will keep 1 coreon "standby" so we can up the VM to a 2-core VM if we are seeingunacceptable performance problems with one-core.


*BRANDING AND OTHER CONTENT/ACCESS CHANGES*

I've asked for feedback and "doer" support on the content aspect of thewiki and the forum. There has been hundreds of associate emails andunstructured discussion but no hard decisions taken. Drew and Dave Fhave offered to get involved here, but we need to set up accounts etc.,and move into execution.

* *DECISION*: We will cut over the wiki and the forums with the contentas-is and implement branding and access control changes within the a.oinfrastructure when volunteers come on-stream to resource this. This isthe standard "transfer then clean-up" approach adopted when a migrationis time critical.


*PRIORITISATION*

One the one hand the forums involve a lot less work and technical risk.On the other they are arguably also used more than the wiki at themoment, and the post rates are a LOT higher. Because I am a singleresource constraint, I can't do both at the same time. This one istoss-a-coin, but my instinct is to get the one that we can do quicklydone. But if there is a strong consensus to the swap then I can do it.

* *DECISION*: The priority is to work the forums over the wiki. We cutthe forums over first.


*CUT-OVER*

There are two facets to cut-over: content move and DNS-based IPreassignment. Clearly we need to freeze update access to the servicesprior to start of content move and continue update-freeze on the legacyservice. Bringing the content across involves a backup, copy restorewhich can be rehearsed and scripted, but in the case of the wiki, thiswill be a few hours even if fully automated.


The main issues are:

* Migration coordination is more of a Programme Management /Coordination challenge rather than a purely technical one. In the old(paid/corporate) days, I would have a Programme Manager (PM)-typeworking along side me covering this aspect.

* *REQUEST* Would anyone who has previous experience of doing this liketo volunteer to take this role, so I can focus on the technical stuff?

* We have to transfer DNS control for oo.o to Apache even if the Aand MX records point to the @Oracle IP addresses.

* *ACTION*: Our PM needs to identify who the authoritative controllerfor the DNS entry in A.o is and how we interface with him or her duringthis change process.

* The DNS IP reassignment can take 24 hrs or more to ripple aroundthe worldwide hierarchy of DNS servers. During this period who goes towhich service is undefined.

* There are many way "to skin the cat" of the migration process. Allwill involve some service loss, but the complexity of the rehearsal andplanning come explode as we reduce this outage to a zero. Complex planscan also go wrong so my instinct is to keep it simple: halt the serviceat a pre-notified time, transfer and start new service at apre-notified time.

* *DECISION*: Halt the forum service for a notified (24hr) window duringcutover. The migration uses fixed IPs, so DNP IP reassignment isco-incident with service stop.

* *GOAL*: Cut over forums within 7 days from today. Date TBD by PM. Ican do the content move.

* *DECISION*: Halt the wiki service for a notified (24hr) window duringcutover. The migration uses fixed IPs, so DNP IP reassignment isco-incident with service stop.

* *GOAL*: Cut over forums within 14 days from today. Date TBD by PM. Ican do the content move.

* We have some further caching tweaks on the interaction of theMediaWiki applicaiton with the ATS HTTP reverse proxy cache, but theseare probably nice-to-have than essential. More to the point these needto be done on a system will production load patterns.


* *DECISION*. We will defer such tuning until post go-live.


*OTHER ISSUES*

* I am pretty much maxed out on "high-priority" tasks at the moment,so I can't accept any more tasks until I've made material progress on mycurrent committed list

* A number of us have serious concerns about the continuity issuesaround the [email protected] forwarding service. I feel that there isa project consensus that this service needs to be relocated Apache.orginfrastructure until we can sentence its content. The fact that wedon't have an owner doing as I am with the wiki and forum services is aGRAVE CONCERN. I was hoping to step up to do this, but with thedecisions to upgrade phpBB and MW, the previous point now applies.

If you have got to the bottom of this, then thanks for your patience andtime.


Regards
Terry

PS. Please remember to break off specific discussion points ontoseparate threads.

[migration] Making the forums and wiki cut-over

Reply via email to