On Fri, Mar 02, 2012 at 08:51:03AM -0600, Elijah Wright wrote: > We use Friday at noon to Monday noon for weekend coverage. > > We stopped using weeklong oncall shifts last year and went to daily > rotation (we use pagerduty to organize this). It's far less draining, and > it encouraged us to get better about ticketing issues to smooth handoff.
Oh man, that sounds pretty great. How did it work out? I remember when I got a job managing the largest fleet I had managed until then (and have managed since) - During the interview, they mentioned they had 12 hour, 7 day pager cycles (for one week, you were on from 6:00 to 18:00 local time. we had another team on the other side of the planet for the rest of the time.) I was excited. Oh man, 12 hours, this is going to be easy! I mean, I had been on call 24x7x365 for most of my career, but mostly at smaller places with tens or hundreds of servers. On a bad week, I'd get woken up twice by pager, and I'd put some effort into making things more robust and it would quiet down. From that point of view, 12 hours sounded easy. The problem was that this place had tens of thousands of servers, and the automation to handle failed servers was non-existant. You'd get paged with a really urgent problem every 30 minutes, because with that much hardware, you get a lot of failures, and the system design was such that a single down server was an urgent outage. So it ended up being a 90 hour extremely high stress week, then you'd get a month to recover, then it'd happen again. I always thought that switching off daily (or even every 8 hours for something this stressful) would have made more sense, but they didn't try that until I quit to go back to hourly contracting (without a pager, partly for more money to spend on prgmr.com, but mostly to escape the 90 hour weeks.) The real solution would be to make hardware failures not a big deal at the application level (and this wasn't hosting, this was an application where that sort of thing could have been done) or to automate the tools the SysAdmins used to deal with the problems by hand, (and this could also be done; we were the *NIX people, not the hardware people. You could automate what we did without, you know, robots.) but in the year I was there, it never happened. Maybe we were all too burnt out from our time on duty? I don't know. I know I personally feel some shame that I never put a technical solution to that problem in place. > Weekend rotations are just an override atop our daily schedule in PD. > Someone updates it every month or so to keep the weekends planned out a > couple/three months in advance. How do you feel about Pagerduty vs. just hooking nagios directly to twillio or something? Hooking nagios to twillio was on my todo list, but if pagerduty is significantly easier and/or more featureful, eh, it's certainly in the reasonable price range for something that solves a problem near the top of my todo list. One thing I want is a phone call that won't stop calling until I press a number to ack it, or something. _______________________________________________ Discuss mailing list [email protected] https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss This list provided by the League of Professional System Administrators http://lopsa.org/
