Nagios is rather limited, especially when it comes to trend analysis and
bigger picture stuff. No man is an island, but mostly as far as nagios
is concerned every item is. It's focussed on noting what's happening in
the here and now with a particular thing, which is great for alerting
you to problems, but not so good for retrospective or broad-spectrum
analysis. Something is alerting, can you relatively quickly and easily
see trend data for related metrics? e.g. a web site is being slow, can
you see apache requests per second, disk IO, database queries per
second, network latency etc, not just as at that particular moment in
time but in the run up to it? What is your root cause?
Most graphing plugins for nagios rely on RRDTool which is great, but
also has problems with precision. To get trend analysis you need to
integrate tools like smokeping or cacti, neither of which were designed
to operate with Nagios, so you're glueing two disparate systems
together. Nagios doesn't scale that well (nor does cacti from personal
experience) and you can often find yourself needing to spread out the
workload amongst a number of boxes (particularly when you're ISP
scale+). Unfortunately there is no integrated mechanism for doing so
(Zabbix provides easily configurable proxy servers, but falls short in
other areas.) If you're running config management you could push that
into the logic for the config management tool, or maybe use tools like
DMX to help.
Other tools even take a step away from config files and look to
auto-detection and configuration.
Here's a real world example of monitoring, based on something Shopzilla
does in house. They're a site which sees over 19m uniques a month.
They express a number of mbeans (metrics) over JMX from their production
and test environment, for example how long it took to process a page
request, and down to how long foo-bar function took to run. Alongside
that are the big picture stuff like stack heap sizes, garbage collection
runs, nursery size etc, and alongside that from outside of the java
instances they're looking at server resources, revenue figures, all the
business logic stuff. All this data is collected in real-time from
their applications. By piecing it all together they can see that every
additional second taken for certain actions equates to $x loss in
revenue (they make money based on click-throughs).
Nagios will tell me if I hit a threshold or not, it won't tell me that
just slightly higher load on this, that and the other metrics are going
to result in us missing revenue targets for the hour. It doesn't see
the bigger picture, or involves a fair bit of work and manual
configuration to enable. There is no way to say "Let me know if cpu
usage of this application is x% higher than usual for this day of the
week on this week of the month, compared to the last 6 months".
In combination with tools like Graphite for graphing Shopzilla are able
to produce useful and extremely accurate graphs that allow them at a
glance to see that they've got a problem, and probable cause, or as is
most likely causes. They're even presenting information in such
readable formats that their non-technical staff are able to see and
understand what it means to their aspect of business.
In my experience Nagios is incapable of scaling up to that level, and
can't do even a fraction of what that allows (I'm not sure I've ever
used a monitoring system that is), but it's a very real world need for
monitoring. The more we know, the better we can track down problems
(provided it's presented sensibly.. too much data can confuse an
issue). If Shopzilla's in-house solution can handle that scale of
monitoring, it should scale down very nicely too. For all the fancy
stuff around it it all boils down to gathering information, storing,
presenting and alerting, exactly the same as any other monitoring
solution. How much duplication of effort has there been in building
similar systems in other environments?
We can do better, and companies are proving it with their own in-house
systems. Now is a good chance to bring what has been learnt there and
elsewhere and see what we can make.
Paul
On 07/22/2011 10:33 AM, Joseph Kern wrote:
Funny ... I am just sitting here configuring Nagios, and marveling at
how much power there is in an object oriented template system and
wondering why it isn't used more ...
Adam's xkcd comic had me laughing when it was first posted, now it has
me cringing.
Tom's mention of the four ponies of the monitoring apocalypse are a
great starting point.
So ... what is going to be different than Zenoss, MRTG, Nagios,
MS-SCOM, HP Openview, etc.? I've used them all ... and the only one I
complained about was MS-SCOM (although it DID have a few nice features).
The monitoring market has high table stakes. What are you going to do
that can't be implemented by a large organization that already has a
monitoring product?
On Fri, Jul 22, 2011 at 3:58 PM, Paul Graydon <[email protected]
<mailto:[email protected]>> wrote:
On 07/22/2011 09:16 AM, Robert Hajime Lanning wrote:
On 07/22/11 09:44, Paul Graydon wrote:
On 7/22/2011 2:29 AM, Adam Moskowitz wrote:
Paul Graydon wrote:
Hopefully with a good wide spread of interest and
talents we could
finally get a monitoring tool that doesn't
actually suck!
And what color pony do you want with that?
Seriously, given the incredibly wide range of
applications, situations,
SLAs, services, constraints, conditions, and
requirements, I think the
idea that a single tool will solve everyone's problems
is, well, nothing
short of ludicrous.
By making /everything/ modular and extensible, and having
the monitoring
platform be a framework which individual components are
natively plugged
in to, everything from data collection, to presentation,
reporting or
responding . That's what the proposal seems to boil down
to. It's
something we're sadly lacking with most monitoring
solutions that I've
ever seen. It's almost entirely 'their way or the high
way', with a few
bolt-ons on the side, fudged into place just to get by
(with all the
unreliability and risk that implies)
Then you end up with HP OpenView...
ugh
So help them make it not HP OpenView. Point out the mistakes made
with that platform, what it's good at and what it's bad at.
They're at the very initial design stages, not implementation and
so now is the time to help ensure what they produce goes the right
way.
It's rare to get a chance to influence a product in these stages,
usually by the time people start really talking the initial
implementation is already done (along with what may be bad design
decisions.) Most of these solutions come out of something coded
to meet a businesses specific needs, not a bunch of people across
a number of different businesses and environments collaborating.
What we've got here are a bunch of dedicated and talented
programmers and operations people motivated to solve a real
problem, and not only willing but enthusiastic about spending
their spare time on it. We'd be utter fools not to capitalise on
that. We can either sit here and moan about how bad an idea this
is and 3 years down the line be proven correct as yet another
product fails to meet the real operations needs, or participate
and help to make something that makes a serious attempt to fix a
very real and significant problem, and maybe, just maybe, 3 years
down the line find you've got something of use.
Paul
_______________________________________________
Discuss mailing list
[email protected] <mailto:[email protected]>
https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
http://lopsa.org/
_______________________________________________
Discuss mailing list
[email protected]
https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
http://lopsa.org/