[Labs-l] Down instances & proposed Nagios changes/issues

208.97.132.231 Sun, 30 Sep 2012 11:44:42 -0700

Complaining bit
==========

Once again I'm trying to clear up monitoring so we can improve it. Thefollowing instances are currently reporting as down (some have been forquite a while);


* test5 - i-00000026.eqiad.wmflabs

** Ryan - ping/nrpe is restricted from pmtpa to eqiad, is this intended?Do we want 1 Nagios instance per Region or centralized monitoring. Notan issue currently but needs deciding/sorting before bringing the regiononline.

* wlm-mysql-master - i-0000040c.pmtpa.wmflabs
* wep - i-000000c2.pmtpa.wmflabs
* analytics - i-000000e2.pmtpa.wmflabs
* deployment-backup - i-000000f8.pmtpa.wmflabs
* deployment-feed - i-00000118.pmtpa.wmflabs
* configtest-main - i-000002dd.pmtpa.wmflabs
* deployment-cache-bits02 - i-0000031c.pmtpa.wmflabs
* puppet-abogott - i-00000389.pmtpa.wmflabs
* mobile-wlm2 - i-0000038e.pmtpa.wmflabs
* conventionextension-test - i-000003c0.pmtpa.wmflabs
* lynwood - i-000003e5.pmtpa.wmflabs
* wlm-apache1 - i-0000040b.pmtpa.wmflabs

If any of these are yours could you either;

a) Reply, if it's still pending file recovery from the block storagemigration (these should all be done).

b) Delete, if it's not used and has no plan for being used.

c) Start, if its purpose is to be used and is just stopped from the lastoutage (block storage migration).d) Reply, if it needs to be down for some reason (I'll mark it as suchin monitoring, so it doesn't spam the channel)e) Reply, if it's online and functioning as expected (there might be asecurity group etc issue)


Current problems with monitoring
=====================

While monitoring everything based on puppet classes makes perfect sensefor production, currently because most things are notpackaged/puppetized and we're half dev and half 'semi-production'monitoring rather sucks.

Due to the current state of labs I suggest that we add an attribute toinstance entries in LDAP that allows monitoring to be enabled anddefault to not monitoring.

Now while that may seem silly, currently we can't really enable therelay bot without flooding the channel with nonsense which makesmonitoring redundant.

Actually limiting spam to things we care about (public http instances,mysql servers etc) we can easily see when things are actually breaking.


Downsides
---------

We loose a general overview of instances, which causes a more reactiveapproach - however we're no so proactive currently.


Implementation choices
===============
a) Based on puppet classes (current usage)
Pros;
* Monitoring is standard
* Monitoring is automatic

Cons;
* We're suppose to be developing, not standardizing (at this point)
* Important services get masked by dev instances
* We're not really monitoring services (they're not puppetized, yet)

b) Based on user input (possibly stored in ldap as an entry under the host)
Pros;
* People can test/develop monitoring
* We can monitor things not yet puppetized
* We can ignore unimportant things

Cons;
* Monitoring isn't standard
* Monitoring isn't automatic
* We're breaking from production in style

While I'd love to spend my time convincing people using puppet is theway forward, quite frankly the current state of the repo is a mess. It'spartly not usable in labs AFAIK (due to the way parameters are handledand is general a mix of bad/confusing code that's whitespace hell.

As we move over to role classes with parametrized classes in modules itshould be easier and quicker to get changes in.

Until there is a push monitoring is either mostly redundant or we canwork on improving it. As we have semi-production stuff I think we shouldimprove it.

The issues become around if we want to enable user based monitoring andtreat nagios as a dev environment along side puppet classes, keep puppetclasses exclusively, use user input exclusively or split the usage into2 and have puppet based for 'production' services and user based for dev.

It would be 'easy' to allow 'extra' monitoring data to be specified onan instances subpage, or even bang it in LDAP - however this couldencourage a path that we don't want.


Features I'd like to see
==============

* User access to the web interface (ldap authenticated, based on projectmembership)* More extensive monitoring of services (think about the beta projectand how crappy the monitoring for it is currently)* Optional subscription of alerts on a per project bases (think aboutsemi-production stuff where it would be nice to get an email saying it'sborked)

* Puppetization of the setup

* Expansion of the groups/templates to include everything in puppetthat's monitored in production (currently it's a very small common list).

* Grouping based on region

* Grouping based on host (this is currently exposed via labsconsole, wecould scrape it for info or talk to nova directly I guess. Harder thanthe above)

* A bot that doesn't die randomly

* A way to shard monitoring (per region) for when we get so manyinstances it's not possible to have a single crappy box


Features I'd be interested in exploring
=======================

* Using saltstack to grab monitoring info (for example puppet last runtime, this can be calculated from a state file and pushed back to themonitoring instance or polled using minion to minion salt access). SNMPtraps kinda suck, rely on people updating their puppet clones etc.Without adding sudo access to nrpe and writing a script for it there'sno other way to get root level access to grab the file data. Extendingsaltstack (if we do end up using it widely) and creating a 'feedbackloop' would be nice* Being able to monitor misc data/servers (think labsconsole - currentlythings like controllers are monitored on production Nagios, this dataisn't however relayed to #wikimedia-labs or widly open to the labscommunity). While monitoring infrastructure from within its self isn't agood idea generally from a centralized community point it might be nice.* Adding other software (Graphite) to the 'common use' 'monitoringstack'. For example in bots it would be nice to a) monitoring theprocesses/random data in nagios but also b) push metrics out and havehistorical graphs. Downside is graphite isn't currently packaged forubuntu in public repos, it is somewhere for prod though. Also would needsome form of proxy to determine project name prefix for data coming in.* Adding a real api to labsconsole to expose the data we have in thereas well as allowing the creation/configuration and deletion ofinstances. JSON output of SMW searches rather sucks a little due to thefiltering etc.* Exposing current status/uptime stats per project and instance onlabsconsole (not sure how easy it would be to transclude this/imagesfrom ganglia). The instance pages are mostly useless and uninterestingto look at. For example on the beta project it would be interesting tobe able to say 'it's been up 99.98% this month with a response time ofxxxms'. With data we can at least have an idea when things are goingcrappy rather than 'it's broken', 'now it's not'.


TL;DR

Our monitoring currently sucks, we need to get to a place where rollingout a cluster based on puppet classes gets auto monitored but also allowdevelopment without masking useful alerts.

I'm not too sure on the perfect solution right now, however I'd lovesome feedback/ideas from everyone else and to publicise what monitoringwe do have generally.


Damian

_______________________________________________
Labs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/labs-l

[Labs-l] Down instances & proposed Nagios changes/issues

Reply via email to