Re: [lopsa-discuss] Weekly oncall rotations: Which day do you start on?

Elijah Wright Sun, 04 Mar 2012 18:30:44 -0800

On Sun, Mar 4, 2012 at 7:24 PM, Luke S. Crawford <[email protected]> wrote:
> On Fri, Mar 02, 2012 at 08:51:03AM -0600, Elijah Wright wrote:
>> We use Friday at noon to Monday noon for weekend coverage.
>>
>> We stopped using weeklong oncall shifts last year and went to daily
>> rotation (we use pagerduty to organize this).   It's far less draining, and
>> it encouraged us to get better about ticketing issues to smooth handoff.
>
> Oh man, that sounds pretty great.  How did it work out?


It's qualitatively better.  Rather than losing productivity for a
solid week, and then losing the NEXT
week as you work off your sleep deficit (if you believe in such
things...), you lose a day or so at a
time, and then occasionally a weekend of heightened awareness of what
the state of things really is.

Steppin' up one's game about monitoring knocks back the number of
alerts, too - find all the broke crap, fix it, and get fewer alerts.
:-)

Some folks were discussing MTBF on chassis Friday - does anybody know
if there are actual MTBF numbers for chassis failures on
Dell gear (that are not simply made up...)?  We mostly see "lots and
lots of drive failures" on RAID arrays, and those are pretty easily
actionable - the number of total chassis failures is so low that I
have trouble tracking it.  ;)


> The real solution would be to make hardware failures not a big deal
> at the application level (and this wasn't hosting, this was an application
> where that sort of thing could have been done)  or to automate the
> tools the SysAdmins used to deal with the problems by hand, (and
> this could also be done;  we were the *NIX people, not the hardware
> people.  You could automate what we did without, you know, robots.)
> but in the year I was there, it never happened.  Maybe we were all too
> burnt out from our time on duty?  I don't know.  I know I personally
> feel some shame that I never put a technical solution to that problem
> in place.


One place that I worked a few years ago, we got within striking
distance of having our Nagios install able to
open/close/escalate/self-heal a variety of very common issues.  It's
*great* when you get to that point - it makes your life not suck.


>> Weekend rotations are just an override atop our daily schedule in PD.
>> Someone updates it every month or so to keep the weekends planned out a
>> couple/three months in advance.
>
> How do you feel about Pagerduty vs. just hooking nagios directly to
> twillio or something?   Hooking nagios to twillio was on my todo list,
> but if pagerduty is significantly easier and/or more featureful,
> eh, it's certainly in the reasonable price range for something that
> solves a problem near the top of my todo list.

Effectively, it's the same thing.  Just shunt your nagios alert
messages into the email inbox that you set up in pagerduty.
Complicated setup can be e.g. "pingdom alerts go to this address"
"drive failures go to this one" and "system reboots get sent here" -
so that you can sort them by alerting needs and the like. [If one had
NOC staff who were responsible for drive swaps, for instance, it would
make a ton of sense to just send those to the right folks...]


> One thing I want is a phone call that won't stop calling until I
> press a number to ack it, or something.

That's PagerDuty.  I think you'd like it.  [When it devolves to me
doing free PR for pagerduty - color me impressed with their service.]

I think I've seen it 'lag' only one time - the rest of the alerting
has been incredibly timely and on-the-ball.  [I imagine that they'd
love to know that I saw a funny blip, but I never bothered to tell
them....]

best,

--e
_______________________________________________
Discuss mailing list
[email protected]
https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
 http://lopsa.org/

Re: [lopsa-discuss] Weekly oncall rotations: Which day do you start on?

Reply via email to