On 30/08/11 15:09, Samuel Richardson wrote:
> I think it has relevance to Ruby, at least I know I like to monitor my
> delayed_jobs/redis/sphinx using something (usually Monit).
>
> I originally asked because I was curious how others were handling
> doing the same thing.
>
> You've mentioned a couple of times, "what happens if monit/init falls
> over" how do you manage Nagios falling over? Do you
> have redundant systems or just notice if an SMS hasn't come in for a
> while?
Good question. So this descends into a war of which system will go
forever without falling over monit or nagios ... neither I'm sure -
everything breaks eventually for some reason.

But with our setup if nagios died, we would notice.

On the "monitored" machine the nagios nrpe agent would return a failure
if it died and we get alerts. Same for if someone pulls the plug out of
the machine.

And yes, we have multiple nagios instances (in different data centres)
connecting in to test the monitored machines.

Look, monit and nagios are doing different things with a slight overlap
in some cases - neither can do all that the other can. There is also the
likes of cacti which is also a "monitoring" solution - for trend
analysis. Same verb in English, but different roles completely.

When I think "monitoring and nagios" I'm really thinking - something
that will page me in the instance the ethernet cord is proverbially
kicked out when someone is moving a rack ... We want to be made aware so
that we can take some action. And we try to monitor as many of our
services and applications in this way in a simplistic "Is it working?"
fashion which is not always easy or practical.

This is a topic I've given some thought to and been involved in from a
management point of view. And to repeat myself the "enterprise"
monitoring package is far more about staff management and documentation
than technology. I don't care how the SMS gets generated to tell me
there is an issue, as long as someone who receives it is actually being
able to resolve the issue - requiring knowledge and competence.

We always put our alerts to the "3am Sysadmin test" - i.e. would a
sysadmin who has logged in to this box once before in their life be able
to make sense of what to do based on wiki documentation and the
information in the alert? If not, we don't include the monitoring (or we
may do only business hours).


>
> Samuel Richardson
> www.richardson.co.nz <http://www.richardson.co.nz> | 0405 472 748
>
>
> On Tue, Aug 30, 2011 at 3:03 PM, Andrew Boag
> <[email protected] <mailto:[email protected]>> wrote:
>
>     Ok, more banter ... we should probably go off list if this goes on
>     any more (it's not really ruby chatter any more :-)
>
>
>     On 30/08/11 14:40, Dmytrii Nagirniak wrote:
>>
>>         We haven't used monit, but it looks to me like it's more of a
>>         solution that will try to "solve" the failure i.e. restart
>>         apache/postgres if the system goes away.
>>
>>
>>     I think the main purpose of monit is to monitor and take an
>>     action. The action can be "solving" the issues as well as sending
>>     notifications and waking up stuff.
>
>     Sure, like I said, I'm not an expert on monit and if you can get
>     it working. Awesome.
>
>     Nagios has a relatively shiny GUI which allows you to selectively
>     downtime checks etc and all that.
>
>
>>
>>      
>>
>>         This is fine but what if _this_ (i.e. the monit script)
>>         process fails?
>>
>>
>>     It looks like guys already thought about that. So it it should be
>>     handled pretty well.
>>     http://mmonit.com/wiki/Monit/FAQ#init
>
>     Once again, looks good. What about if it jams after having init-ed ?
>
>     Still, there ain't no perfect solution to this. We just happen to
>     have a lot of experience (from trial and lots of error) with our
>     approach.
>
>
>>
>>
>>     -- 
>>     You received this message because you are subscribed to the
>>     Google Groups "Ruby or Rails Oceania" group.
>>     To post to this group, send email to
>>     [email protected]
>>     <mailto:[email protected]>.
>>     To unsubscribe from this group, send email to
>>     [email protected]
>>     <mailto:[email protected]>.
>>     For more options, visit this group at
>>     http://groups.google.com/group/rails-oceania?hl=en.
>
>
>     -- 
>
>     ----
>
>     Andrew Boag - Director
>     Catalyst IT
>     [email protected] <mailto:[email protected]>
>
>     mob: +61 421 528 125 <tel:%2B61%20421%20528%20125>
>     ddi: +61 2 8002 1758 <tel:%2B61%202%208002%201758>
>
>     www.catalyst-au.net <http://www.catalyst-au.net>
>
>     -- 
>     You received this message because you are subscribed to the Google
>     Groups "Ruby or Rails Oceania" group.
>     To post to this group, send email to
>     [email protected]
>     <mailto:[email protected]>.
>     To unsubscribe from this group, send email to
>     [email protected]
>     <mailto:rails-oceania%[email protected]>.
>     For more options, visit this group at
>     http://groups.google.com/group/rails-oceania?hl=en.
>
>
> -- 
> You received this message because you are subscribed to the Google
> Groups "Ruby or Rails Oceania" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to
> [email protected].
> For more options, visit this group at
> http://groups.google.com/group/rails-oceania?hl=en.


-- 

----

Andrew Boag - Director
Catalyst IT
[email protected]

mob: +61 421 528 125
ddi: +61 2 8002 1758

www.catalyst-au.net

-- 
You received this message because you are subscribed to the Google Groups "Ruby 
or Rails Oceania" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/rails-oceania?hl=en.

Reply via email to