Re: [lopsa-discuss] Monitoring Sucks!

Joseph Kern Sat, 23 Jul 2011 05:09:12 -0700

That is certainly food for thought Paul.

But how will you abstract application based monitoring and reporting? From
your example it seems that the value is derived from knowledge of the
application methods and developement stack (including java rutime
idiosyncracies).


How will I benifit if I am running a python or ruby application?
On Jul 22, 2011 6:17 PM, "Paul Graydon" <[email protected]> wrote:
>
> Nagios is rather limited, especially when it comes to trend analysis and
bigger picture stuff.  No man is an island, but mostly as far as nagios is
concerned every item is.  It's focussed on noting what's happening in the
here and now with a particular thing, which is great for alerting you to
problems, but not so good for retrospective or broad-spectrum analysis.
 Something is alerting, can you relatively quickly and easily see trend data
for related metrics?  e.g. a web site is being slow, can you see apache
requests per second, disk IO, database queries per second, network latency
etc, not just as at that particular moment in time but in the run up to it?
 What is your root cause?
> Most graphing plugins for nagios rely on RRDTool which is great, but also
has problems with precision.  To get trend analysis you need to integrate
tools like smokeping or cacti, neither of which were designed to operate
with Nagios, so you're glueing two disparate systems together.  Nagios
doesn't scale that well (nor does cacti from personal experience) and you
can often find yourself needing to spread out the workload amongst a number
of boxes (particularly when you're ISP scale+).  Unfortunately there is no
integrated mechanism for doing so (Zabbix provides easily configurable proxy
servers, but falls short in other areas.)  If you're running config
management you could push that into the logic for the config management
tool, or maybe use tools like DMX to help.
> Other tools even take a step away from config files and look to
auto-detection and configuration.
>
> Here's a real world example of monitoring, based on something Shopzilla
does in house.  They're a site which sees over 19m uniques a month.  They
express a number of mbeans (metrics) over JMX from their production and test
environment, for example how long it took to process a page request, and
down to how long foo-bar function took to run.  Alongside that are the big
picture stuff like stack heap sizes, garbage collection runs, nursery size
etc, and alongside that from outside of the java instances they're looking
at server resources, revenue figures, all the business logic stuff.  All
this data is collected in real-time from their applications.  By piecing it
all together they can see that every additional second taken for certain
actions equates to $x loss in revenue (they make money based on
click-throughs).
>
> Nagios will tell me if I hit a threshold or not, it won't tell me that
just slightly higher load on this, that and the other metrics are going to
result in us missing revenue targets for the hour.  It doesn't see the
bigger picture, or involves a fair bit of work and manual configuration to
enable.  There is no way to say "Let me know if cpu usage of this
application is x% higher than usual for this day of the week on this week of
the month, compared to the last 6 months".
>
> In combination with tools like Graphite for graphing Shopzilla are able to
produce useful and extremely accurate graphs that allow them at a glance to
see that they've got a problem, and probable cause, or as is most likely
causes.  They're even presenting information in such readable formats that
their non-technical staff are able to see and understand what it means to
their aspect of business.
>
> In my experience Nagios is incapable of scaling up to that level, and
can't do even a fraction of what that allows (I'm not sure I've ever used a
monitoring system that is), but it's a very real world need for monitoring.
 The more we know, the better we can track down problems (provided it's
presented sensibly.. too much data can confuse an issue).  If Shopzilla's
in-house solution can handle that scale of monitoring, it should scale down
very nicely too.  For all the fancy stuff around it it all boils down to
gathering information, storing, presenting and alerting, exactly the same as
any other monitoring solution.  How much duplication of effort has there
been in building similar systems in other environments?
>
> We can do better, and companies are proving it with their own in-house
systems.  Now is a good chance to bring what has been learnt there and
elsewhere and see what we can make.
>
> Paul
>
>
>
> On 07/22/2011 10:33 AM, Joseph Kern wrote:
>>
>> Funny ... I am just sitting here configuring Nagios, and marveling at how
much power there is in an object oriented template system and wondering why
it isn't used more ...
>>
>> Adam's xkcd comic had me laughing when it was first posted, now it has me
cringing.
>> Tom's mention of the four ponies of the monitoring apocalypse are a great
starting point.
>>
>> So ... what is going to be different than Zenoss, MRTG, Nagios, MS-SCOM,
HP Openview, etc.? I've used them all ... and the only one I complained
about was MS-SCOM (although it DID have a few nice features).
>>
>> The monitoring market has high table stakes. What are you going to do
that can't be implemented by a large organization that already has a
monitoring product?
>>
>>
>> On Fri, Jul 22, 2011 at 3:58 PM, Paul Graydon <[email protected]<mailto:
[email protected]>> wrote:
>>
>>    On 07/22/2011 09:16 AM, Robert Hajime Lanning wrote:
>>
>>        On 07/22/11 09:44, Paul Graydon wrote:
>>
>>            On 7/22/2011 2:29 AM, Adam Moskowitz wrote:
>>
>>                Paul Graydon wrote:
>>
>>                    Hopefully with a good wide spread of interest and
>>                    talents we could
>>                    finally get a monitoring tool that doesn't
>>                    actually suck!
>>
>>                And what color pony do you want with that?
>>
>>                Seriously, given the incredibly wide range of
>>                applications, situations,
>>                SLAs, services, constraints, conditions, and
>>                requirements, I think the
>>                idea that a single tool will solve everyone's problems
>>                is, well, nothing
>>                short of ludicrous.
>>
>>            By making /everything/ modular and extensible, and having
>>            the monitoring
>>            platform be a framework which individual components are
>>            natively plugged
>>            in to, everything from data collection, to presentation,
>>            reporting or
>>            responding . That's what the proposal seems to boil down
>>            to.  It's
>>            something we're sadly lacking with most monitoring
>>            solutions that I've
>>            ever seen.  It's almost entirely 'their way or the high
>>            way', with a few
>>            bolt-ons on the side, fudged into place just to get by
>>            (with all the
>>            unreliability and risk that implies)
>>
>>        Then you end up with HP OpenView...
>>        ugh
>>
>>    So help them make it not HP OpenView.  Point out the mistakes made
>>    with that platform, what it's good at and what it's bad at.
>>     They're at the very initial design stages, not implementation and
>>    so now is the time to help ensure what they produce goes the right
>>    way.
>>
>>    It's rare to get a chance to influence a product in these stages,
>>    usually by the time people start really talking the initial
>>    implementation is already done (along with what may be bad design
>>    decisions.)  Most of these solutions come out of something coded
>>    to meet a businesses specific needs, not a bunch of people across
>>    a number of different businesses and environments collaborating.
>>
>>    What we've got here are a bunch of dedicated and talented
>>    programmers and operations people motivated to solve a real
>>    problem, and not only willing but enthusiastic about spending
>>    their spare time on it.  We'd be utter fools not to capitalise on
>>    that.  We can either sit here and moan about how bad an idea this
>>    is and 3 years down the line be proven correct as yet another
>>    product fails to meet the real operations needs, or participate
>>    and help to make something that makes a serious attempt to fix a
>>    very real and significant problem, and maybe, just maybe, 3 years
>>    down the line find you've got something of use.
>>
>>    Paul
>>
>>    _______________________________________________
>>    Discuss mailing list
>>    [email protected] <mailto:[email protected]>
>>
>>    https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
>>    This list provided by the League of Professional System Administrators
>>    http://lopsa.org/
>>
>>
>

_______________________________________________
Discuss mailing list
[email protected]
https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
 http://lopsa.org/

Re: [lopsa-discuss] Monitoring Sucks!

Reply via email to