Sorry in advance. I've tried to put this type of content on the cwiki and it has largely been ignored by the DL, hence this email. It is a long onem so have I tried to structure it and keep to simple plain text (as the DL forwarder seems to screw up text/html markup).

Since this is so long, any reply threads could kill us. So can I ask respondents to open up a new [migration] thread to discuss specific points in depth rather than replying directly to this.

*BACKGROUND*

As you all know, I've been doing work on the forums and wiki, though the wiki has been my main focus recently. I am acting in two roles here (i) the lead SysAdmin for the ooo-wiki and ooo-forum VMS; (ii) the lead (and currently only) application maintainer on both systems. I also have ssh access to the current prod systems running in Oracle infrastructure and do equivalent roles there. So in practice, I am doing all of this related work, including liaising with the project, the infrastructure team and with Andrew Rist who represents Oracle here.

I understandably have to work within the practices of the Apache infrastructure team to retain my permit to act as SysAdmin on these two VMs. I am also trying to meet the expectations of the project, the infrastructure teams and the needs of our user population across all of this work.

We seem to have a Catch-22 here, and this email is about how we break this and move these aspects of the project forward. My interpretation of this Catch-22 is that whilst our current interactions on the DL are a good basis for individuals articulating views on a particular thread (and some seem to generate hundreds of viewpoints) we have no functioning mechanism to move to, and adopt some form of, a consensus policy or decision. The exact project requirements for this wiki, these forums and the [email protected] mail forwarders are cases in point. However, the infrastructure team believe that we, the project, have an urgency about making this cut over from Oracle to Apache infrastructure, and are pressing me to make progress.

I can't execute any plan without a baseline requirement and set of assumptions, so what this note attempts is to lay down such a set, and the decisions that need to be made to go forward. So PLEASE, I don't want any flames about my use of DECISION below. What I simply mean is the if the PPMC as a body accepts these, then I will try my best to move this work forward. Of course you are free to challenge / change any of this if that is a PPMC voted decision, but in this case I need to move into a different mode; to suspend work and stop the clock until we have an PPMC-endorsed baseline to replan on. I am NOT going to press on without broad endorsement and then be criticised in retrospect for doing so.

*INFRASTRUCTURE DRIVERS*

The infrastructure team has a policy of bringing in new services at current S/W versions whenever possible -- simply because it makes it easier to support then, and doing this before the service is on-line involves less work and risk that when it is in production. I understand and agree with this goal even though it can front-load work.

* The infrastructure stack is base on a standard Ubuntu server LAMP stack as at current LTS (Ubuntu 10.04-3 LTS) which included PHP 5.3.2

* The forums are stable, but at an N-1 release level. (phpBB 3.0.8 vs. 3.0.9).

* *DECISION*: Upgrade the ooo-forums phpBB app + customisations to 3.0.9 before go-live. (Based on my last 5 upgrades, this 1-2 days work, the main part being the regression of a 1K line customisation patch when we rebaseline the package from 3.0.8 -> 3.0.9)

* The prod wiki is v1.15.1 that at an N-3 major release level (that's 30 months old: two major and 10 minor revisions behind the current supported). This also runs on PHP 5.2.0.

* We need an reverse-proxy HTTP cache for performance reasons on the wiki. One of the four market leaders in this niche is another Apache project: Apache Traffic Server (ATS). It makes sense to stay "in-house" here for both support and referenceability reasons

* *DECISION*: Adopt ATS v3.0.1 as the HTTP cache for the wiki. (BTW, this work has been done and the product is excellent).

The PHP 5.3 introduced extra checking to remove an area of tolerance the PHP 5.x<3 allowed. This was to do with when and how parameters can be passed by reference under curtain circumstances. So moving a code base from 5.2 to 5.3 involved a lot of work identifying and eliminating this mis-codings. This was done by the MediaWiki team in MW v1.16. I had planned to move to MW v1.15.5 (the last stable 1.15.x) as our baseline and I've done this work integrating it with Apache Traffic Server (ATS) and our LAMP stack. This is stable and performant enough to show that we are good. However, I have only identified and bug-fixed the main path 5.2->5.3 coding issues. During my testing I have subsequently discovered others and there are undoubtedly more to find. I've also discussed this with the MW devs on the MW IRC channel. Given this, the consensus in the @infra team (me included) is that we should bite the bullet now and move to current MW 1.17.0 even given the extra work. There are some performance risks associated with MW 1.17.0 which we need to mitigate. However, given that we've already got a complete LAMP+ATS+MV in an ESX hosted VM performing like a dingbat, we really only face the 1.15.5 -> 1.17.0 issues in this step.

* *DECISION*: Upgrade the ooo-wiki MediaWiki(MV) + all extensions to MW v1.17.0.

* *DECISION*: I have agreed with infrastructure that we will keep 1 core on "standby" so we can up the VM to a 2-core VM if we are seeing unacceptable performance problems with one-core.

*BRANDING AND OTHER CONTENT/ACCESS CHANGES*

I've asked for feedback and "doer" support on the content aspect of the wiki and the forum. There has been hundreds of associate emails and unstructured discussion but no hard decisions taken. Drew and Dave F have offered to get involved here, but we need to set up accounts etc., and move into execution.

* *DECISION*: We will cut over the wiki and the forums with the content as-is and implement branding and access control changes within the a.o infrastructure when volunteers come on-stream to resource this. This is the standard "transfer then clean-up" approach adopted when a migration is time critical.

*PRIORITISATION*

One the one hand the forums involve a lot less work and technical risk. On the other they are arguably also used more than the wiki at the moment, and the post rates are a LOT higher. Because I am a single resource constraint, I can't do both at the same time. This one is toss-a-coin, but my instinct is to get the one that we can do quickly done. But if there is a strong consensus to the swap then I can do it.

* *DECISION*: The priority is to work the forums over the wiki. We cut the forums over first.

*CUT-OVER*

There are two facets to cut-over: content move and DNS-based IP reassignment. Clearly we need to freeze update access to the services prior to start of content move and continue update-freeze on the legacy service. Bringing the content across involves a backup, copy restore which can be rehearsed and scripted, but in the case of the wiki, this will be a few hours even if fully automated.

The main issues are:

* Migration coordination is more of a Programme Management / Coordination challenge rather than a purely technical one. In the old (paid/corporate) days, I would have a Programme Manager (PM)-type working along side me covering this aspect.

* *REQUEST* Would anyone who has previous experience of doing this like to volunteer to take this role, so I can focus on the technical stuff?

* We have to transfer DNS control for oo.o to Apache even if the A and MX records point to the @Oracle IP addresses.

* *ACTION*: Our PM needs to identify who the authoritative controller for the DNS entry in A.o is and how we interface with him or her during this change process.

* The DNS IP reassignment can take 24 hrs or more to ripple around the worldwide hierarchy of DNS servers. During this period who goes to which service is undefined.

* There are many way "to skin the cat" of the migration process. All will involve some service loss, but the complexity of the rehearsal and planning come explode as we reduce this outage to a zero. Complex plans can also go wrong so my instinct is to keep it simple: halt the service at a pre-notified time, transfer and start new service at a pre-notified time.

* *DECISION*: Halt the forum service for a notified (24hr) window during cutover. The migration uses fixed IPs, so DNP IP reassignment is co-incident with service stop.

* *GOAL*: Cut over forums within 7 days from today. Date TBD by PM. I can do the content move.

* *DECISION*: Halt the wiki service for a notified (24hr) window during cutover. The migration uses fixed IPs, so DNP IP reassignment is co-incident with service stop.

* *GOAL*: Cut over forums within 14 days from today. Date TBD by PM. I can do the content move.

* We have some further caching tweaks on the interaction of the MediaWiki applicaiton with the ATS HTTP reverse proxy cache, but these are probably nice-to-have than essential. More to the point these need to be done on a system will production load patterns.

* *DECISION*. We will defer such tuning until post go-live.


*OTHER ISSUES*

* I am pretty much maxed out on "high-priority" tasks at the moment, so I can't accept any more tasks until I've made material progress on my current committed list

* A number of us have serious concerns about the continuity issues around the [email protected] forwarding service. I feel that there is a project consensus that this service needs to be relocated Apache.org infrastructure until we can sentence its content. The fact that we don't have an owner doing as I am with the wiki and forum services is a GRAVE CONCERN. I was hoping to step up to do this, but with the decisions to upgrade phpBB and MW, the previous point now applies.

If you have got to the bottom of this, then thanks for your patience and time.

Regards
Terry

PS. Please remember to break off specific discussion points onto separate threads.


Reply via email to