I am the operator of i-0000040c.pmtpa.wmflabs, and i-0000040b.pmtpa.wmflabs.
I personally am not sure why they are showing as offline as they are not. ~Jason On Sun, Sep 30, 2012 at 2:44 PM, 208.97.132.231 <[email protected]>wrote: > Complaining bit > ========== > Once again I'm trying to clear up monitoring so we can improve it. The > following instances are currently reporting as down (some have been for > quite a while); > > * test5 - i-00000026.eqiad.wmflabs > ** Ryan - ping/nrpe is restricted from pmtpa to eqiad, is this intended? > Do we want 1 Nagios instance per Region or centralized monitoring. Not an > issue currently but needs deciding/sorting before bringing the region > online. > * wlm-mysql-master - i-0000040c.pmtpa.wmflabs > * wep - i-000000c2.pmtpa.wmflabs > * analytics - i-000000e2.pmtpa.wmflabs > * deployment-backup - i-000000f8.pmtpa.wmflabs > * deployment-feed - i-00000118.pmtpa.wmflabs > * configtest-main - i-000002dd.pmtpa.wmflabs > * deployment-cache-bits02 - i-0000031c.pmtpa.wmflabs > * puppet-abogott - i-00000389.pmtpa.wmflabs > * mobile-wlm2 - i-0000038e.pmtpa.wmflabs > * conventionextension-test - i-000003c0.pmtpa.wmflabs > * lynwood - i-000003e5.pmtpa.wmflabs > * wlm-apache1 - i-0000040b.pmtpa.wmflabs > > If any of these are yours could you either; > a) Reply, if it's still pending file recovery from the block storage > migration (these should all be done). > b) Delete, if it's not used and has no plan for being used. > c) Start, if its purpose is to be used and is just stopped from the last > outage (block storage migration). > d) Reply, if it needs to be down for some reason (I'll mark it as such in > monitoring, so it doesn't spam the channel) > e) Reply, if it's online and functioning as expected (there might be a > security group etc issue) > > Current problems with monitoring > ===================== > While monitoring everything based on puppet classes makes perfect sense > for production, currently because most things are not packaged/puppetized > and we're half dev and half 'semi-production' monitoring rather sucks. > > Due to the current state of labs I suggest that we add an attribute to > instance entries in LDAP that allows monitoring to be enabled and default > to not monitoring. > > Now while that may seem silly, currently we can't really enable the relay > bot without flooding the channel with nonsense which makes monitoring > redundant. > > Actually limiting spam to things we care about (public http instances, > mysql servers etc) we can easily see when things are actually breaking. > > Downsides > --------- > We loose a general overview of instances, which causes a more reactive > approach - however we're no so proactive currently. > > Implementation choices > =============== > a) Based on puppet classes (current usage) > Pros; > * Monitoring is standard > * Monitoring is automatic > > Cons; > * We're suppose to be developing, not standardizing (at this point) > * Important services get masked by dev instances > * We're not really monitoring services (they're not puppetized, yet) > > b) Based on user input (possibly stored in ldap as an entry under the host) > Pros; > * People can test/develop monitoring > * We can monitor things not yet puppetized > * We can ignore unimportant things > > Cons; > * Monitoring isn't standard > * Monitoring isn't automatic > * We're breaking from production in style > > While I'd love to spend my time convincing people using puppet is the way > forward, quite frankly the current state of the repo is a mess. It's partly > not usable in labs AFAIK (due to the way parameters are handled and is > general a mix of bad/confusing code that's whitespace hell. > > As we move over to role classes with parametrized classes in modules it > should be easier and quicker to get changes in. > > Until there is a push monitoring is either mostly redundant or we can work > on improving it. As we have semi-production stuff I think we should improve > it. > > The issues become around if we want to enable user based monitoring and > treat nagios as a dev environment along side puppet classes, keep puppet > classes exclusively, use user input exclusively or split the usage into 2 > and have puppet based for 'production' services and user based for dev. > > It would be 'easy' to allow 'extra' monitoring data to be specified on an > instances subpage, or even bang it in LDAP - however this could encourage a > path that we don't want. > > Features I'd like to see > ============== > * User access to the web interface (ldap authenticated, based on project > membership) > * More extensive monitoring of services (think about the beta project and > how crappy the monitoring for it is currently) > * Optional subscription of alerts on a per project bases (think about > semi-production stuff where it would be nice to get an email saying it's > borked) > * Puppetization of the setup > * Expansion of the groups/templates to include everything in puppet that's > monitored in production (currently it's a very small common list). > * Grouping based on region > * Grouping based on host (this is currently exposed via labsconsole, we > could scrape it for info or talk to nova directly I guess. Harder than the > above) > * A bot that doesn't die randomly > * A way to shard monitoring (per region) for when we get so many instances > it's not possible to have a single crappy box > > Features I'd be interested in exploring > ======================= > * Using saltstack to grab monitoring info (for example puppet last run > time, this can be calculated from a state file and pushed back to the > monitoring instance or polled using minion to minion salt access). SNMP > traps kinda suck, rely on people updating their puppet clones etc. Without > adding sudo access to nrpe and writing a script for it there's no other way > to get root level access to grab the file data. Extending saltstack (if we > do end up using it widely) and creating a 'feedback loop' would be nice > * Being able to monitor misc data/servers (think labsconsole - currently > things like controllers are monitored on production Nagios, this data isn't > however relayed to #wikimedia-labs or widly open to the labs community). > While monitoring infrastructure from within its self isn't a good idea > generally from a centralized community point it might be nice. > * Adding other software (Graphite) to the 'common use' 'monitoring stack'. > For example in bots it would be nice to a) monitoring the processes/random > data in nagios but also b) push metrics out and have historical graphs. > Downside is graphite isn't currently packaged for ubuntu in public repos, > it is somewhere for prod though. Also would need some form of proxy to > determine project name prefix for data coming in. > * Adding a real api to labsconsole to expose the data we have in there as > well as allowing the creation/configuration and deletion of instances. JSON > output of SMW searches rather sucks a little due to the filtering etc. > * Exposing current status/uptime stats per project and instance on > labsconsole (not sure how easy it would be to transclude this/images from > ganglia). The instance pages are mostly useless and uninteresting to look > at. For example on the beta project it would be interesting to be able to > say 'it's been up 99.98% this month with a response time of xxxms'. With > data we can at least have an idea when things are going crappy rather than > 'it's broken', 'now it's not'. > > TL;DR > Our monitoring currently sucks, we need to get to a place where rolling > out a cluster based on puppet classes gets auto monitored but also allow > development without masking useful alerts. > > I'm not too sure on the perfect solution right now, however I'd love some > feedback/ideas from everyone else and to publicise what monitoring we do > have generally. > > Damian > > _______________________________________________ > Labs-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/labs-l > > -- *Thank you for contacting Jason Spriggs.*
_______________________________________________ Labs-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/labs-l
