On Sun, Mar 4, 2012 at 7:24 PM, Luke S. Crawford <[email protected]> wrote: > On Fri, Mar 02, 2012 at 08:51:03AM -0600, Elijah Wright wrote: >> We use Friday at noon to Monday noon for weekend coverage. >> >> We stopped using weeklong oncall shifts last year and went to daily >> rotation (we use pagerduty to organize this). It's far less draining, and >> it encouraged us to get better about ticketing issues to smooth handoff. > > Oh man, that sounds pretty great. How did it work out?
It's qualitatively better. Rather than losing productivity for a solid week, and then losing the NEXT week as you work off your sleep deficit (if you believe in such things...), you lose a day or so at a time, and then occasionally a weekend of heightened awareness of what the state of things really is. Steppin' up one's game about monitoring knocks back the number of alerts, too - find all the broke crap, fix it, and get fewer alerts. :-) Some folks were discussing MTBF on chassis Friday - does anybody know if there are actual MTBF numbers for chassis failures on Dell gear (that are not simply made up...)? We mostly see "lots and lots of drive failures" on RAID arrays, and those are pretty easily actionable - the number of total chassis failures is so low that I have trouble tracking it. ;) > The real solution would be to make hardware failures not a big deal > at the application level (and this wasn't hosting, this was an application > where that sort of thing could have been done) or to automate the > tools the SysAdmins used to deal with the problems by hand, (and > this could also be done; we were the *NIX people, not the hardware > people. You could automate what we did without, you know, robots.) > but in the year I was there, it never happened. Maybe we were all too > burnt out from our time on duty? I don't know. I know I personally > feel some shame that I never put a technical solution to that problem > in place. One place that I worked a few years ago, we got within striking distance of having our Nagios install able to open/close/escalate/self-heal a variety of very common issues. It's *great* when you get to that point - it makes your life not suck. >> Weekend rotations are just an override atop our daily schedule in PD. >> Someone updates it every month or so to keep the weekends planned out a >> couple/three months in advance. > > How do you feel about Pagerduty vs. just hooking nagios directly to > twillio or something? Hooking nagios to twillio was on my todo list, > but if pagerduty is significantly easier and/or more featureful, > eh, it's certainly in the reasonable price range for something that > solves a problem near the top of my todo list. Effectively, it's the same thing. Just shunt your nagios alert messages into the email inbox that you set up in pagerduty. Complicated setup can be e.g. "pingdom alerts go to this address" "drive failures go to this one" and "system reboots get sent here" - so that you can sort them by alerting needs and the like. [If one had NOC staff who were responsible for drive swaps, for instance, it would make a ton of sense to just send those to the right folks...] > One thing I want is a phone call that won't stop calling until I > press a number to ack it, or something. That's PagerDuty. I think you'd like it. [When it devolves to me doing free PR for pagerduty - color me impressed with their service.] I think I've seen it 'lag' only one time - the rest of the alerting has been incredibly timely and on-the-ball. [I imagine that they'd love to know that I saw a funny blip, but I never bothered to tell them....] best, --e _______________________________________________ Discuss mailing list [email protected] https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss This list provided by the League of Professional System Administrators http://lopsa.org/
