Re: [lopsa-discuss] Monitoring Sucks!

Paul Graydon Fri, 22 Jul 2011 15:18:01 -0700

Nagios is rather limited, especially when it comes to trend analysis andbigger picture stuff. No man is an island, but mostly as far as nagiosis concerned every item is. It's focussed on noting what's happening inthe here and now with a particular thing, which is great for alertingyou to problems, but not so good for retrospective or broad-spectrumanalysis. Something is alerting, can you relatively quickly and easilysee trend data for related metrics? e.g. a web site is being slow, canyou see apache requests per second, disk IO, database queries persecond, network latency etc, not just as at that particular moment intime but in the run up to it? What is your root cause?Most graphing plugins for nagios rely on RRDTool which is great, butalso has problems with precision. To get trend analysis you need tointegrate tools like smokeping or cacti, neither of which were designedto operate with Nagios, so you're glueing two disparate systemstogether. Nagios doesn't scale that well (nor does cacti from personalexperience) and you can often find yourself needing to spread out theworkload amongst a number of boxes (particularly when you're ISPscale+). Unfortunately there is no integrated mechanism for doing so(Zabbix provides easily configurable proxy servers, but falls short inother areas.) If you're running config management you could push thatinto the logic for the config management tool, or maybe use tools likeDMX to help.Other tools even take a step away from config files and look toauto-detection and configuration.

Here's a real world example of monitoring, based on something Shopzilladoes in house. They're a site which sees over 19m uniques a month.They express a number of mbeans (metrics) over JMX from their productionand test environment, for example how long it took to process a pagerequest, and down to how long foo-bar function took to run. Alongsidethat are the big picture stuff like stack heap sizes, garbage collectionruns, nursery size etc, and alongside that from outside of the javainstances they're looking at server resources, revenue figures, all thebusiness logic stuff. All this data is collected in real-time fromtheir applications. By piecing it all together they can see that everyadditional second taken for certain actions equates to $x loss inrevenue (they make money based on click-throughs).

Nagios will tell me if I hit a threshold or not, it won't tell me thatjust slightly higher load on this, that and the other metrics are goingto result in us missing revenue targets for the hour. It doesn't seethe bigger picture, or involves a fair bit of work and manualconfiguration to enable. There is no way to say "Let me know if cpuusage of this application is x% higher than usual for this day of theweek on this week of the month, compared to the last 6 months".

In combination with tools like Graphite for graphing Shopzilla are ableto produce useful and extremely accurate graphs that allow them at aglance to see that they've got a problem, and probable cause, or as ismost likely causes. They're even presenting information in suchreadable formats that their non-technical staff are able to see andunderstand what it means to their aspect of business.

In my experience Nagios is incapable of scaling up to that level, andcan't do even a fraction of what that allows (I'm not sure I've everused a monitoring system that is), but it's a very real world need formonitoring. The more we know, the better we can track down problems(provided it's presented sensibly.. too much data can confuse anissue). If Shopzilla's in-house solution can handle that scale ofmonitoring, it should scale down very nicely too. For all the fancystuff around it it all boils down to gathering information, storing,presenting and alerting, exactly the same as any other monitoringsolution. How much duplication of effort has there been in buildingsimilar systems in other environments?

We can do better, and companies are proving it with their own in-housesystems. Now is a good chance to bring what has been learnt there andelsewhere and see what we can make.


Paul


On 07/22/2011 10:33 AM, Joseph Kern wrote:

Funny ... I am just sitting here configuring Nagios, and marveling athow much power there is in an object oriented template system andwondering why it isn't used more ...

Adam's xkcd comic had me laughing when it was first posted, now it hasme cringing.Tom's mention of the four ponies of the monitoring apocalypse are agreat starting point.

So ... what is going to be different than Zenoss, MRTG, Nagios,MS-SCOM, HP Openview, etc.? I've used them all ... and the only one Icomplained about was MS-SCOM (although it DID have a few nice features).

The monitoring market has high table stakes. What are you going to dothat can't be implemented by a large organization that already has amonitoring product?

On Fri, Jul 22, 2011 at 3:58 PM, Paul Graydon <[email protected]<mailto:[email protected]>> wrote:


    On 07/22/2011 09:16 AM, Robert Hajime Lanning wrote:

        On 07/22/11 09:44, Paul Graydon wrote:

            On 7/22/2011 2:29 AM, Adam Moskowitz wrote:

                Paul Graydon wrote:

                    Hopefully with a good wide spread of interest and
                    talents we could
                    finally get a monitoring tool that doesn't
                    actually suck!

                And what color pony do you want with that?

                Seriously, given the incredibly wide range of
                applications, situations,
                SLAs, services, constraints, conditions, and
                requirements, I think the
                idea that a single tool will solve everyone's problems
                is, well, nothing
                short of ludicrous.

            By making /everything/ modular and extensible, and having
            the monitoring
            platform be a framework which individual components are
            natively plugged
            in to, everything from data collection, to presentation,
            reporting or
            responding . That's what the proposal seems to boil down
            to.  It's
            something we're sadly lacking with most monitoring
            solutions that I've
            ever seen.  It's almost entirely 'their way or the high
            way', with a few
            bolt-ons on the side, fudged into place just to get by
            (with all the
            unreliability and risk that implies)

        Then you end up with HP OpenView...
        ugh

    So help them make it not HP OpenView.  Point out the mistakes made
    with that platform, what it's good at and what it's bad at.
     They're at the very initial design stages, not implementation and
    so now is the time to help ensure what they produce goes the right
    way.

    It's rare to get a chance to influence a product in these stages,
    usually by the time people start really talking the initial
    implementation is already done (along with what may be bad design
    decisions.)  Most of these solutions come out of something coded
    to meet a businesses specific needs, not a bunch of people across
    a number of different businesses and environments collaborating.

    What we've got here are a bunch of dedicated and talented
    programmers and operations people motivated to solve a real
    problem, and not only willing but enthusiastic about spending
    their spare time on it.  We'd be utter fools not to capitalise on
    that.  We can either sit here and moan about how bad an idea this
    is and 3 years down the line be proven correct as yet another
    product fails to meet the real operations needs, or participate
    and help to make something that makes a serious attempt to fix a
    very real and significant problem, and maybe, just maybe, 3 years
    down the line find you've got something of use.

    Paul

    _______________________________________________
    Discuss mailing list
    [email protected] <mailto:[email protected]>
    https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
    This list provided by the League of Professional System Administrators
    http://lopsa.org/


_______________________________________________
Discuss mailing list
[email protected]
https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
http://lopsa.org/

Re: [lopsa-discuss] Monitoring Sucks!

Reply via email to