Re: [lopsa-discuss] Pager rotation for nagios

Sean Lally Tue, 26 Aug 2014 07:20:09 -0700

You could also use the flap detection and dependencies in nagios to help
with alert floods.



On Tue, Aug 26, 2014 at 8:28 AM, Alan Robertson <[email protected]> wrote:

> There's kind of a cool tool for connecting Nagios to PagerDuty called
> Flapjack - designed to avoid flooding people with pages when things go
> south.
>
> http://flapjack.io/
>
>
>
> On 08/25/2014 05:02 PM, Lawrence K. Chen, P.Eng. wrote:
>
>>
>> On 08/25/14 13:44, Warner wrote:
>>
>>> On Mon, Aug 25, 2014 at 06:09:10AM -0700, Nathan Clemons(
>>> [email protected]) wrote:
>>>
>>>> We're looking to set up small teams in nagios and rotate between
>>>> primary and secondary contacts, vs having one global on call person.
>>>> (Ie, two networking folks, two vmware folks, two Unix folks, etc.)
>>>> What kind of solutions have folks tried for this? Pagerduty seems
>>>> excessively priced for this kind of task, especially when we're trying
>>>> to trim opex costs. When I worked at /. we used sendmail aliases to
>>>> control the paging and just ran a script from cron to adjust the list
>>>> to the next person in line on Monday morning.
>>>>
>>> In the past, I've used qmail dot files and shell scripts. Standardized
>>> the contacts on e-mail aliases. That can work well.
>>>
>>> Since then, I've become a big fan of Pager Duty. Not having to maintain
>>> a separate schedule, having a central point for notifications, and
>>> additional bells and whistles such as notification when going on call
>>> are huge wins.
>>>
>>> Both approaches work well. Pager Duty does have value though, I wouldn't
>>> write it off.
>>>
>>>
>>> Warner
>>>
>>>  I don't know much about pagerduty, except one group on campus that
>> shares our
>> Nagios server is using it.
>>
>> So, there's perl script to tie into nagios hasn't left a good impression
>> on me.
>>
>> A couple week after I had set it up, I noticed it had spawned 1000s of
>> copies
>> of itself and our server was close to death....clearing, it would just
>> start
>> building up again.  I thought about making a promise to deal with it in
>> the
>> short term, though I could recall if CFEngine 2.2 had the capability or
>> what
>> its syntax might be.  Saw there were some notificaitons queued, and that
>> they
>> were all hanging on that....seems the first get's stuck on it, and the
>> rest
>> get stuck on the first process still being there.
>>
>> In trying to see what it was doing...found its trying to post to some
>> https
>> URL through LWP.  Except it still seems that after more than 10+
>> years...LWP
>> https through a proxy is still busted, so don't know why this script would
>> expect to work....
>>
>> And, a proxy is needed because the server is in private IP space
>> (eventually
>> our entire datacenter will be....though sounds like it'll all be behind
>> our
>> F5, but its been WIP for almost a couple of years now.)
>>
>> In the meantime its largely neglected/forgotten squid proxy server that I
>> threw up back in 2007 to replace the one that everybody depended on, but
>> nobody claimed ownership for when the last of some UltraServer 2's were
>> decommissioned.  Its running in a Solaris Zone, which has been moved and
>> undergone upgrade on attaches a few times....
>>
>> After a couple of days, I opted for an earlier suggestion I had found
>> online.
>>   I used a Perl module of LWP-Proxy-Connect (still waiting to see if
>> it'll get
>> accepted into FreeBSD Ports) to make the script work.  Just a one line
>> change
>> to the pagerduty script, IIRC, and it started working again....
>>
>> That was until I let CFEngine loose again, and it reverted it :)
>>
>> While I was working on it, the group using it finally logged a couple of
>> tickets...one about unable to fork errors, and that they had stopped
>> getting
>> notifications, where they thought there should've been some on the
>> weekend.
>> (they were the ones killing my Nagios server.)
>>
>> Later, they added that it had worked up to the Friday before....
>>
>> Finally they admitted that they had changed it that Friday from using
>> email to
>> posting to web for notifications.  (had I known, I might have just
>> suggested
>> they switch it back :)
>>
>> Hadn't really thought about our notifications from this Nagios server now
>> being dependent on our smtp server....our old server had been in the
>> datacenter range that is completely open to the world....so it did its own
>> mail delivery (especially important when it used to largely inform us of
>> problems with campus email...)  Though its getting hard for me to handle
>> notifications timely/safely....
>>
>>
> _______________________________________________
> Discuss mailing list
> [email protected]
> https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
> This list provided by the League of Professional System Administrators
> http://lopsa.org/
>

_______________________________________________
Discuss mailing list
[email protected]
https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
 http://lopsa.org/

Re: [lopsa-discuss] Pager rotation for nagios

Reply via email to