On Wed, Dec 31, 2014 at 10:48 AM, Bryan Davis <[email protected]> wrote:
> I think this will mess some things up in beta (file uploads, debug > logging). If having a partial outage of beta from just after a group0 > deploy to just before a new branch release sounds bad to any of you > you may want to respond to Coren's thread on labs-l and suggest a > better time (Thurs-Fri?). > I think Thu/Fri would definitely be preferable. People do a ton of pre-deploy checking on beta labs early in the week. VE for sure and probably MobileFrontend would be particularly affected. > > Bryan > > > ---------- Forwarded message ---------- > From: Marc A. Pelletier <[email protected]> > Date: Wed, Dec 31, 2014 at 10:11 AM > Subject: [Labs-l] Filesystem downtime to schedule > To: Wikimedia Labs <[email protected]> > > > Hello Labs, > > Many of you may recall that until some point late 2013, one of the > features of the labs file server was that it provided time travel > snapshots (you could see a consistent view of the filesystem as it > existed 1h, 2h, 3h, 1d, 2d, 3d and 1 week ago). > > This was disabled at that time - despite being generally considered > valuable - because it was suspected to be (part of) the stability > problems the NFS server suffered at the time. This turns out to not > have been the case, and we could turn it back on now. > > Indeed, doing so is a prerequisite to the planned replication of the > filesystem in the new datacenter where a redundant Labs installation > is slated to be deployed[1]. > > The issue is that turning that feature back on requires changing the > way the disk space is currently allocated at a low level[2] and > necessitates a fairly long period of partial downtime during which > data is being copied from one part of the disk subsystem to the other. > In practice, this would require the primary partitions (/home and > /data/project) to be set readonly for a period on the order of a day > (24-30 hours). > > That downtime is pretty much unavoidable eventually as it is a > requirement of expanding labs and improving data resillience and > reliability, but the /timing/ of that is flexible. I wanted to "poll" > labs users as to when the possibility of disruption is minimized, and > give everyone plenty of time to make contingency planning and/or > notify their endusers of the expected period of reduced availability. > > Provided there is a good consensus that the week is a better time than > the weekend (I am guessing here that volunteer coders and users are > more active during the weekend) then I would suggest starting the > operation on Tuesday, January 13 at 18:00 UTC. The period of downtime > is expected to last until January 14, 18:00 UTC but may extend a few > hours beyond that. > > The expected impacts are: > > * Starting at the beginning of the window, /home and /data/project > will switch to readonly mode; any attempt to write to files to those > trees will result in EROFS errors being thrown. Reading from those > filesystems will still work as expected, so would writing to other > filesystems; > * Read performance may degrade noticably as the disk subsystem will be > loaded to capacity; > * It will not be possible to manipulate the gridengine queue - > specifically, starting or stopping jobs will not work; and > * At the end of the window, when the operation is complete, the "old" > file system will go away and be replaced by the new one - this will > cause any access to files or directories that were previously opened > (including working directories) on the affected filesystems to error > out with ESTALE. Reopening files by name will access the new copy > identical to the one at the time the filesystems became readonly. > > In practice, that latter impact has the effect that most running > programs will be unable to continue unless they have special handling > for this situation, and most gridengine jobs will no longer be able to > log output. It may be a good idea to restart any continuous tool at > that time. All webservices that were running at the start of the > maintenance window will be restarted at that time. > > If you have tools or other processes running that do not rely on being > able to write to /data/project, they may be able to continue running > during the downtime without interruption. Jobs that only access the > network (for instance, the Mediawiki API) or the databases will not > likely be affected. Because of this, no automatic or forcible restart > of running (non-webservice) jobs will be made. > > In particular, if you have a tool whose continued operation is > important, temporarily modifying it so that it works from > /data/scratch may be a good workaround. > > Finally, in order to avoid risks of the filesystem move taking longer > than expected and increasing downtime significantly, LOG FILES OVER 1G > WILL BE NOT BE COPIED. If you have critical files that are not simple > log files but whose names end in .log, .err or .out then you MUST > compress those files if you absolutely require them to survive the > transition. Alternately, truncating them to some size comfortably > smaller than 1G will work if the file must remain uncompressed. > > The speed and reliability of the maintenance process depends on the > total data to copy. If you can clean up both your home and project > directories of extraneous files, you'll help the process greatly. :-) > > Thanks all, > > -- Marc > > _______________________________________________ > Labs-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/labs-l > > > -- > Bryan Davis Wikimedia Foundation <[email protected]> > [[m:User:BDavis_(WMF)]] Sr Software Engineer Boise, ID USA > irc: bd808 v:415.839.6885 x6855 > > _______________________________________________ > QA mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/qa >
_______________________________________________ QA mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/qa
