Thanks to everyone for their responses! Fortunately, I don't think anyone here thinks that we can just do a drop-in replacement for Hyperic. There are so many metrics that you can monitor, and it's highly unlikely that they'll be configured in the same way (polling intervals, alert thresholds, etc.) as a previous tool. Combining that with alerting schedules, groups, etc., and it's a substantial task.
On the alerting end, we're slowly moving towards getting everyone in our department to use OpsGenie. The truth is that you're never going to have just one monitoring tool these days - I think we have ten or so in various states of use - so you need something that can aggregate all of your alerts and notify you in a uniform fashion. It's way easier to have your monitoring software say "This is a priority one alert," then ship it off to OpsGenie (PagerDuty, etc.), and let that worry about who's on call, how people should be notified, etc. It also allows people to choose how they'd like to be notified (SMS, e-mail, app, phone call, etc.). No way we could ever go back--it's just such an awesome tool. Patrick - I actually remember you giving a lightning talk on Sensu a couple of years back. It's got good community support, so it's one of the things I'm going to demo in the next few days. Does anyone have experience with LogicMonitor, New Relic, Scout, Librato, Traverse, etc.? I could see those tools being a very good fit for us, provided that metrics are collected in a sane fashion (we don't need hundreds of devices trying to report off site) and the billing model is flexible (we want our VMs now, thank you very much). We've also got strong in-house Windows experience: has anyone used SCOM to monitor Linux devices? John On Thu, Feb 4, 2016 at 8:38 AM, Antony Rudie <[email protected]> wrote: > That last sentence is god's own truth. > > One more thing you need: A solid plan for what happens to the alerts. > Where do they go? who deals with them? How do you know if they've been > dealt with? It's not rocket science, but it's really important. I worked > in a place where that piece was missing, and frankly, I'm not sure the whole > monitoring setup added any value. > > On Wed, Feb 3, 2016 at 7:52 PM, John Stoffel <[email protected]> wrote: >> >> >> One of the things that people seem to miss, or overlook in my opinion >> is the cost of doing all this monitoring, and the steep learning curve >> you have for all of it. It's going to suck in a bunch of time at >> first, way more than people think, and getting it tuned so that it's >> not sending out false alarms is a huge task. >> >> I've played with Nagios, and we have Solarwinds at $WORK, but neither >> is well done or really used outside of silos. I also played around >> with collectd and graphite, but found it too simplistic in terms of >> access control for what I/we wanted. And we have an old instance of >> WhatsUp running as well for another group. It's all hodge podge. We >> really should dedicate someone to doing this work, but we all keep >> getting pulled in new directions all the time. >> >> It might be easier if you're just upgrading from something and you >> know what you want to monitor, etc. But it's not a simple drop-in >> tool like some people make out. It requires commitment and discipline >> to use effectively. >> >> John >> >> _______________________________________________ >> bblisa mailing list >> [email protected] >> http://www.bblisa.org/mailman/listinfo/bblisa > > -- John Miller Systems Engineer Brandeis University [email protected] (781) 736-4619 _______________________________________________ bblisa mailing list [email protected] http://www.bblisa.org/mailman/listinfo/bblisa
