Nagios is rather limited, especially when it comes to trend analysis and bigger picture stuff. No man is an island, but mostly as far as nagios is concerned every item is. It's focussed on noting what's happening in the here and now with a particular thing, which is great for alerting you to problems, but not so good for retrospective or broad-spectrum analysis. Something is alerting, can you relatively quickly and easily see trend data for related metrics? e.g. a web site is being slow, can you see apache requests per second, disk IO, database queries per second, network latency etc, not just as at that particular moment in time but in the run up to it? What is your root cause? Most graphing plugins for nagios rely on RRDTool which is great, but also has problems with precision. To get trend analysis you need to integrate tools like smokeping or cacti, neither of which were designed to operate with Nagios, so you're glueing two disparate systems together. Nagios doesn't scale that well (nor does cacti from personal experience) and you can often find yourself needing to spread out the workload amongst a number of boxes (particularly when you're ISP scale+). Unfortunately there is no integrated mechanism for doing so (Zabbix provides easily configurable proxy servers, but falls short in other areas.) If you're running config management you could push that into the logic for the config management tool, or maybe use tools like DMX to help. Other tools even take a step away from config files and look to auto-detection and configuration.

Here's a real world example of monitoring, based on something Shopzilla does in house. They're a site which sees over 19m uniques a month. They express a number of mbeans (metrics) over JMX from their production and test environment, for example how long it took to process a page request, and down to how long foo-bar function took to run. Alongside that are the big picture stuff like stack heap sizes, garbage collection runs, nursery size etc, and alongside that from outside of the java instances they're looking at server resources, revenue figures, all the business logic stuff. All this data is collected in real-time from their applications. By piecing it all together they can see that every additional second taken for certain actions equates to $x loss in revenue (they make money based on click-throughs).

Nagios will tell me if I hit a threshold or not, it won't tell me that just slightly higher load on this, that and the other metrics are going to result in us missing revenue targets for the hour. It doesn't see the bigger picture, or involves a fair bit of work and manual configuration to enable. There is no way to say "Let me know if cpu usage of this application is x% higher than usual for this day of the week on this week of the month, compared to the last 6 months".

In combination with tools like Graphite for graphing Shopzilla are able to produce useful and extremely accurate graphs that allow them at a glance to see that they've got a problem, and probable cause, or as is most likely causes. They're even presenting information in such readable formats that their non-technical staff are able to see and understand what it means to their aspect of business.

In my experience Nagios is incapable of scaling up to that level, and can't do even a fraction of what that allows (I'm not sure I've ever used a monitoring system that is), but it's a very real world need for monitoring. The more we know, the better we can track down problems (provided it's presented sensibly.. too much data can confuse an issue). If Shopzilla's in-house solution can handle that scale of monitoring, it should scale down very nicely too. For all the fancy stuff around it it all boils down to gathering information, storing, presenting and alerting, exactly the same as any other monitoring solution. How much duplication of effort has there been in building similar systems in other environments?

We can do better, and companies are proving it with their own in-house systems. Now is a good chance to bring what has been learnt there and elsewhere and see what we can make.

Paul


On 07/22/2011 10:33 AM, Joseph Kern wrote:
Funny ... I am just sitting here configuring Nagios, and marveling at how much power there is in an object oriented template system and wondering why it isn't used more ...

Adam's xkcd comic had me laughing when it was first posted, now it has me cringing. Tom's mention of the four ponies of the monitoring apocalypse are a great starting point.

So ... what is going to be different than Zenoss, MRTG, Nagios, MS-SCOM, HP Openview, etc.? I've used them all ... and the only one I complained about was MS-SCOM (although it DID have a few nice features).

The monitoring market has high table stakes. What are you going to do that can't be implemented by a large organization that already has a monitoring product?


On Fri, Jul 22, 2011 at 3:58 PM, Paul Graydon <[email protected] <mailto:[email protected]>> wrote:

    On 07/22/2011 09:16 AM, Robert Hajime Lanning wrote:

        On 07/22/11 09:44, Paul Graydon wrote:

            On 7/22/2011 2:29 AM, Adam Moskowitz wrote:

                Paul Graydon wrote:

                    Hopefully with a good wide spread of interest and
                    talents we could
                    finally get a monitoring tool that doesn't
                    actually suck!

                And what color pony do you want with that?

                Seriously, given the incredibly wide range of
                applications, situations,
                SLAs, services, constraints, conditions, and
                requirements, I think the
                idea that a single tool will solve everyone's problems
                is, well, nothing
                short of ludicrous.

            By making /everything/ modular and extensible, and having
            the monitoring
            platform be a framework which individual components are
            natively plugged
            in to, everything from data collection, to presentation,
            reporting or
            responding . That's what the proposal seems to boil down
            to.  It's
            something we're sadly lacking with most monitoring
            solutions that I've
            ever seen.  It's almost entirely 'their way or the high
            way', with a few
            bolt-ons on the side, fudged into place just to get by
            (with all the
            unreliability and risk that implies)

        Then you end up with HP OpenView...
        ugh

    So help them make it not HP OpenView.  Point out the mistakes made
    with that platform, what it's good at and what it's bad at.
     They're at the very initial design stages, not implementation and
    so now is the time to help ensure what they produce goes the right
    way.

    It's rare to get a chance to influence a product in these stages,
    usually by the time people start really talking the initial
    implementation is already done (along with what may be bad design
    decisions.)  Most of these solutions come out of something coded
    to meet a businesses specific needs, not a bunch of people across
    a number of different businesses and environments collaborating.

    What we've got here are a bunch of dedicated and talented
    programmers and operations people motivated to solve a real
    problem, and not only willing but enthusiastic about spending
    their spare time on it.  We'd be utter fools not to capitalise on
    that.  We can either sit here and moan about how bad an idea this
    is and 3 years down the line be proven correct as yet another
    product fails to meet the real operations needs, or participate
    and help to make something that makes a serious attempt to fix a
    very real and significant problem, and maybe, just maybe, 3 years
    down the line find you've got something of use.

    Paul

    _______________________________________________
    Discuss mailing list
    [email protected] <mailto:[email protected]>
    https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
    This list provided by the League of Professional System Administrators
    http://lopsa.org/



_______________________________________________
Discuss mailing list
[email protected]
https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
http://lopsa.org/

Reply via email to