That is certainly food for thought Paul. But how will you abstract application based monitoring and reporting? From your example it seems that the value is derived from knowledge of the application methods and developement stack (including java rutime idiosyncracies).
How will I benifit if I am running a python or ruby application? On Jul 22, 2011 6:17 PM, "Paul Graydon" <[email protected]> wrote: > > Nagios is rather limited, especially when it comes to trend analysis and bigger picture stuff. No man is an island, but mostly as far as nagios is concerned every item is. It's focussed on noting what's happening in the here and now with a particular thing, which is great for alerting you to problems, but not so good for retrospective or broad-spectrum analysis. Something is alerting, can you relatively quickly and easily see trend data for related metrics? e.g. a web site is being slow, can you see apache requests per second, disk IO, database queries per second, network latency etc, not just as at that particular moment in time but in the run up to it? What is your root cause? > Most graphing plugins for nagios rely on RRDTool which is great, but also has problems with precision. To get trend analysis you need to integrate tools like smokeping or cacti, neither of which were designed to operate with Nagios, so you're glueing two disparate systems together. Nagios doesn't scale that well (nor does cacti from personal experience) and you can often find yourself needing to spread out the workload amongst a number of boxes (particularly when you're ISP scale+). Unfortunately there is no integrated mechanism for doing so (Zabbix provides easily configurable proxy servers, but falls short in other areas.) If you're running config management you could push that into the logic for the config management tool, or maybe use tools like DMX to help. > Other tools even take a step away from config files and look to auto-detection and configuration. > > Here's a real world example of monitoring, based on something Shopzilla does in house. They're a site which sees over 19m uniques a month. They express a number of mbeans (metrics) over JMX from their production and test environment, for example how long it took to process a page request, and down to how long foo-bar function took to run. Alongside that are the big picture stuff like stack heap sizes, garbage collection runs, nursery size etc, and alongside that from outside of the java instances they're looking at server resources, revenue figures, all the business logic stuff. All this data is collected in real-time from their applications. By piecing it all together they can see that every additional second taken for certain actions equates to $x loss in revenue (they make money based on click-throughs). > > Nagios will tell me if I hit a threshold or not, it won't tell me that just slightly higher load on this, that and the other metrics are going to result in us missing revenue targets for the hour. It doesn't see the bigger picture, or involves a fair bit of work and manual configuration to enable. There is no way to say "Let me know if cpu usage of this application is x% higher than usual for this day of the week on this week of the month, compared to the last 6 months". > > In combination with tools like Graphite for graphing Shopzilla are able to produce useful and extremely accurate graphs that allow them at a glance to see that they've got a problem, and probable cause, or as is most likely causes. They're even presenting information in such readable formats that their non-technical staff are able to see and understand what it means to their aspect of business. > > In my experience Nagios is incapable of scaling up to that level, and can't do even a fraction of what that allows (I'm not sure I've ever used a monitoring system that is), but it's a very real world need for monitoring. The more we know, the better we can track down problems (provided it's presented sensibly.. too much data can confuse an issue). If Shopzilla's in-house solution can handle that scale of monitoring, it should scale down very nicely too. For all the fancy stuff around it it all boils down to gathering information, storing, presenting and alerting, exactly the same as any other monitoring solution. How much duplication of effort has there been in building similar systems in other environments? > > We can do better, and companies are proving it with their own in-house systems. Now is a good chance to bring what has been learnt there and elsewhere and see what we can make. > > Paul > > > > On 07/22/2011 10:33 AM, Joseph Kern wrote: >> >> Funny ... I am just sitting here configuring Nagios, and marveling at how much power there is in an object oriented template system and wondering why it isn't used more ... >> >> Adam's xkcd comic had me laughing when it was first posted, now it has me cringing. >> Tom's mention of the four ponies of the monitoring apocalypse are a great starting point. >> >> So ... what is going to be different than Zenoss, MRTG, Nagios, MS-SCOM, HP Openview, etc.? I've used them all ... and the only one I complained about was MS-SCOM (although it DID have a few nice features). >> >> The monitoring market has high table stakes. What are you going to do that can't be implemented by a large organization that already has a monitoring product? >> >> >> On Fri, Jul 22, 2011 at 3:58 PM, Paul Graydon <[email protected]<mailto: [email protected]>> wrote: >> >> On 07/22/2011 09:16 AM, Robert Hajime Lanning wrote: >> >> On 07/22/11 09:44, Paul Graydon wrote: >> >> On 7/22/2011 2:29 AM, Adam Moskowitz wrote: >> >> Paul Graydon wrote: >> >> Hopefully with a good wide spread of interest and >> talents we could >> finally get a monitoring tool that doesn't >> actually suck! >> >> And what color pony do you want with that? >> >> Seriously, given the incredibly wide range of >> applications, situations, >> SLAs, services, constraints, conditions, and >> requirements, I think the >> idea that a single tool will solve everyone's problems >> is, well, nothing >> short of ludicrous. >> >> By making /everything/ modular and extensible, and having >> the monitoring >> platform be a framework which individual components are >> natively plugged >> in to, everything from data collection, to presentation, >> reporting or >> responding . That's what the proposal seems to boil down >> to. It's >> something we're sadly lacking with most monitoring >> solutions that I've >> ever seen. It's almost entirely 'their way or the high >> way', with a few >> bolt-ons on the side, fudged into place just to get by >> (with all the >> unreliability and risk that implies) >> >> Then you end up with HP OpenView... >> ugh >> >> So help them make it not HP OpenView. Point out the mistakes made >> with that platform, what it's good at and what it's bad at. >> They're at the very initial design stages, not implementation and >> so now is the time to help ensure what they produce goes the right >> way. >> >> It's rare to get a chance to influence a product in these stages, >> usually by the time people start really talking the initial >> implementation is already done (along with what may be bad design >> decisions.) Most of these solutions come out of something coded >> to meet a businesses specific needs, not a bunch of people across >> a number of different businesses and environments collaborating. >> >> What we've got here are a bunch of dedicated and talented >> programmers and operations people motivated to solve a real >> problem, and not only willing but enthusiastic about spending >> their spare time on it. We'd be utter fools not to capitalise on >> that. We can either sit here and moan about how bad an idea this >> is and 3 years down the line be proven correct as yet another >> product fails to meet the real operations needs, or participate >> and help to make something that makes a serious attempt to fix a >> very real and significant problem, and maybe, just maybe, 3 years >> down the line find you've got something of use. >> >> Paul >> >> _______________________________________________ >> Discuss mailing list >> [email protected] <mailto:[email protected]> >> >> https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss >> This list provided by the League of Professional System Administrators >> http://lopsa.org/ >> >> >
_______________________________________________ Discuss mailing list [email protected] https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss This list provided by the League of Professional System Administrators http://lopsa.org/
