It'd be even cooler if your monitor could learn a virtual machines "normal" or "expected" activity pattern by time of day / day of week and the signal things out of the ordinary. Like the batch activity that was supposed to have been running but took an unexpected low address protection exception and cpu dived to .5% or the online server whose new code release put them into an occasional loop and chewed an engine for a while. (real world examples from oh the last 3 weeks :).
The business of triggering on error messages is always a reactive thing. You get a message, you have a big problem because bad messsage went unnoticed for hours and something on down the line failed, people play cleanup. You add paging automation around that message for the next time... All of this systems automation software could be a lot smarter... Marcy -----Original Message----- From: Linux on 390 Port [mailto:[email protected]] On Behalf Of Rich Smrcina Sent: Thursday, August 19, 2010 4:39 PM To: [email protected] Subject: Re: [LINUX-390] How to convince others. Was: Re: mono keep guest active - ban the blips. If your batch runs regularly or consistently drive some virtual machines to 100% this may not signal a loop condition (which, I would guess, is why the ticket is being raised). Techs may grow conditioned to this and either take longer to respond or just outright 'ignore' the tickets eventually, since the 'normal' course of action is to page for a condition that is unresolvable without a larger share, or redistribution of the load. If only the monitor could 'know' that the machine was running this batch load at a certain time of day and had an absolute share and was running 100% for an extended period of time. It could be set up to not sent out alerts based on all of these criteria. Wow! That would be a very nice feature. When your monitoring department looks at top, vmstat and sar to detect problems, don't forget the kernel numbers lie. Even the new steal timer is a little off. On 08/19/2010 05:51 PM, Berry van Sleeuwen wrote: > True, it isn't. It's the replacement of an operator. The main issue here > is that it needs to raise tickets and get reporting stats. For instance, > raise a ticket at 100% CPU (and indeed, our ABS limithard machines do > raise tickets when they are running their batch..<sigh>.) or when a > filesystem is at 100%. The reporting is for instance on CPU and > filesystem usage. > > But indeed it can't provide insight in the performance of a guest, other > than detect thresholds. And it doesn't have to either, the monitoring > department can look at top, vmstat or sar to detect that kind of > problems should they need to (yeah right, then they know all about the > entire environment). > > Still, as for a case, this is a good point. We need to be able to > address performance related monitoring and nagios can't do that. Or at > least not within the scope of an entire LPAR. > > Thanks, Berry. > -- Rich Smrcina Phone: 414-491-6001 http://www.linkedin.com/in/richsmrcina Catch the WAVV! http://www.wavv.org WAVV 2011 - April 15-19, 2011 Colorado Springs, CO ---------------------------------------------------------------------- For LINUX-390 subscribe / signoff / archive access instructions, send email to [email protected] with the message: INFO LINUX-390 or visit http://www.marist.edu/htbin/wlvindex?LINUX-390 ---------------------------------------------------------------------- For more information on Linux on System z, visit http://wiki.linuxvm.org/ ---------------------------------------------------------------------- For LINUX-390 subscribe / signoff / archive access instructions, send email to [email protected] with the message: INFO LINUX-390 or visit http://www.marist.edu/htbin/wlvindex?LINUX-390 ---------------------------------------------------------------------- For more information on Linux on System z, visit http://wiki.linuxvm.org/
