Rush has uploaded a new change for review. (
https://gerrit.wikimedia.org/r/335373 )
Change subject: wip nodepool: track and alert on age of instance states
......................................................................
wip nodepool: track and alert on age of instance states
'nodepool list' natively tracks the amount of time an instance
is held in a particular state. Nodepool keeps its own state
table that is informed by nova for this information. In theory
recent issues would have been caught by alerting if we had a
valid expected threshold for instances in:
* building (time to spin up)
* delete (time to cleanup an instance)
* active (time from build to 'ready')
* used (time an instance is out running a test)
Change-Id: Ifd28b6e15309efe13d69054c77ae190f741acb8a
Bugs: T156636
---
A modules/nodepool/files/check-nodepool-age.py
1 file changed, 67 insertions(+), 0 deletions(-)
git pull ssh://gerrit.wikimedia.org:29418/operations/puppet
refs/changes/73/335373/1
diff --git a/modules/nodepool/files/check-nodepool-age.py
b/modules/nodepool/files/check-nodepool-age.py
new file mode 100644
index 0000000..74ab560
--- /dev/null
+++ b/modules/nodepool/files/check-nodepool-age.py
@@ -0,0 +1,67 @@
+#!/usr/bin/env python
+import argparse
+from collections import Counter
+import logging
+import sys
+import subprocess
+
+# described as 'Age (hours)' this resets upon state
+# change. i.e. building => ready => used
+thresholds = {
+ 'building': .10,
+ 'delete': .10,
+ 'active': .10,
+ 'used': .4,
+}
+
+def main():
+ argparser = argparse.ArgumentParser()
+
+ argparser.add_argument(
+ '--debug',
+ help='Turn on debug logging',
+ action='store_true'
+ )
+
+ args = argparser.parse_args()
+
+ logging.basicConfig(
+ format='%(asctime)s %(levelname)s %(message)s',
+ level=logging.DEBUG if args.debug else logging.INFO)
+
+ instances_raw = subprocess.check_output(['/usr/bin/nodepool', 'list'])
+
+ instances = {}
+ for line in instances_raw.splitlines():
+ if 'wmflabs-eqiad' in line:
+ props = [x.strip() for x in line.split('|') if x]
+ instances[props[5]] = {
+ 'type': props[3],
+ 'name': props[5],
+ 'UUID': props[7],
+ 'address': props[8],
+ 'state': props[9].lower(),
+ 'age': float(props[10]),
+ }
+
+ issues = []
+ for name, values in instances.iteritems():
+ logging.debug("{}: {}".format(name, str(values)))
+
+ state_max = thresholds.get(values['state'], 0)
+
+ if not state_max:
+ continue
+
+ if values['age'] >= state_max:
+ issues.append(values)
+
+ if len(issues) > 0:
+ logging.debug(str(issues))
+ bad_states = [x['state'] for x in issues]
+ details = str(dict(Counter(bad_states)))
+ print "{} instances are violating max state age
({})".format(len(issues), details)
+ sys.exit(1)
+
+if __name__ == '__main__':
+ main()
--
To view, visit https://gerrit.wikimedia.org/r/335373
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings
Gerrit-MessageType: newchange
Gerrit-Change-Id: Ifd28b6e15309efe13d69054c77ae190f741acb8a
Gerrit-PatchSet: 1
Gerrit-Project: operations/puppet
Gerrit-Branch: production
Gerrit-Owner: Rush <[email protected]>
_______________________________________________
MediaWiki-commits mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits