On Fri, Mar 02, 2012 at 08:51:03AM -0600, Elijah Wright wrote:
> We use Friday at noon to Monday noon for weekend coverage.
> 
> We stopped using weeklong oncall shifts last year and went to daily
> rotation (we use pagerduty to organize this).   It's far less draining, and
> it encouraged us to get better about ticketing issues to smooth handoff.

Oh man, that sounds pretty great.  How did it work out?

I remember when I got a job managing the largest fleet I had managed until
then (and have managed since) -  During the interview, they mentioned they 
had 12 hour, 7 day pager cycles (for one week, you were on from 6:00 to 18:00
local time.  we had another team on the other side of the planet for the
rest of the time.)  I was excited.  Oh man, 12 hours, this is going to 
be easy!  I mean, I had been on call 24x7x365 for most of my career, but
mostly at smaller places with tens or hundreds of servers.  On a bad week,
I'd get woken up twice by pager, and I'd put some effort into making things
more robust and it would quiet down.  From that point of view, 12 hours
sounded easy.

The problem was that this place had tens of thousands of servers, and
the automation to handle failed servers was non-existant.  You'd get paged
with a really urgent problem every 30 minutes, because with that much
hardware, you get a lot of failures, and the system design was such that
a single down server was an urgent outage.

So it ended up being a 90 hour extremely high stress week, then you'd get 
a month to recover, then it'd happen again.   I always thought that 
switching off daily (or even every 8 hours for something this stressful) 
would have made more sense, but they didn't try that until I quit
to go back to hourly contracting (without a pager, partly for more money
to spend on prgmr.com, but mostly to escape the 90 hour weeks.) 

The real solution would be to make hardware failures not a big deal
at the application level (and this wasn't hosting, this was an application
where that sort of thing could have been done)  or to automate the
tools the SysAdmins used to deal with the problems by hand, (and 
this could also be done;  we were the *NIX people, not the hardware
people.  You could automate what we did without, you know, robots.)  
but in the year I was there, it never happened.  Maybe we were all too 
burnt out from our time on duty?  I don't know.  I know I personally 
feel some shame that I never put a technical solution to that problem 
in place.  

> Weekend rotations are just an override atop our daily schedule in PD.
> Someone updates it every month or so to keep the weekends planned out a
> couple/three months in advance.


How do you feel about Pagerduty vs. just hooking nagios directly to 
twillio or something?   Hooking nagios to twillio was on my todo list,
but if pagerduty is significantly easier and/or more featureful,
eh, it's certainly in the reasonable price range for something that
solves a problem near the top of my todo list.   

One thing I want is a phone call that won't stop calling until I
press a number to ack it, or something.  
_______________________________________________
Discuss mailing list
[email protected]
https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
 http://lopsa.org/

Reply via email to