Rush has uploaded a new change for review. ( 
https://gerrit.wikimedia.org/r/335373 )

Change subject: wip nodepool: track and alert on age of instance states
......................................................................

wip nodepool: track and alert on age of instance states

'nodepool list' natively tracks the amount of time an instance
is held in a particular state.  Nodepool keeps its own state
table that is informed by nova for this information.  In theory
recent issues would have been caught by alerting if we had a
valid expected threshold for instances in:

* building (time to spin up)
* delete   (time to cleanup an instance)
* active   (time from build to 'ready')
* used     (time an instance is out running a test)

Change-Id: Ifd28b6e15309efe13d69054c77ae190f741acb8a
Bugs: T156636
---
A modules/nodepool/files/check-nodepool-age.py
1 file changed, 67 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.wikimedia.org:29418/operations/puppet 
refs/changes/73/335373/1

diff --git a/modules/nodepool/files/check-nodepool-age.py 
b/modules/nodepool/files/check-nodepool-age.py
new file mode 100644
index 0000000..74ab560
--- /dev/null
+++ b/modules/nodepool/files/check-nodepool-age.py
@@ -0,0 +1,67 @@
+#!/usr/bin/env python
+import argparse
+from collections import Counter
+import logging
+import sys
+import subprocess
+
+# described as 'Age (hours)' this resets upon state
+# change. i.e. building => ready => used
+thresholds = {
+    'building': .10,
+    'delete': .10,
+    'active': .10,
+    'used': .4,
+}
+
+def main():
+    argparser = argparse.ArgumentParser()
+
+    argparser.add_argument(
+        '--debug',
+        help='Turn on debug logging',
+        action='store_true'
+    )
+
+    args = argparser.parse_args()
+
+    logging.basicConfig(
+        format='%(asctime)s %(levelname)s %(message)s',
+        level=logging.DEBUG if args.debug else logging.INFO)
+
+    instances_raw = subprocess.check_output(['/usr/bin/nodepool', 'list'])
+
+    instances = {}
+    for line in instances_raw.splitlines():
+        if 'wmflabs-eqiad' in line:
+            props = [x.strip() for x in line.split('|') if x]
+            instances[props[5]] = {
+                'type': props[3],
+                'name': props[5],
+                'UUID': props[7],
+                'address': props[8],
+                'state': props[9].lower(),
+                'age': float(props[10]),
+        }
+
+    issues = []
+    for name, values in instances.iteritems():
+        logging.debug("{}: {}".format(name, str(values)))
+
+        state_max = thresholds.get(values['state'], 0)
+
+        if not state_max:
+            continue
+
+        if values['age'] >= state_max:
+            issues.append(values)
+
+    if len(issues) > 0:
+        logging.debug(str(issues))
+        bad_states = [x['state'] for x in issues]
+        details = str(dict(Counter(bad_states)))
+        print "{} instances are violating max state age 
({})".format(len(issues), details)
+        sys.exit(1)
+
+if __name__ == '__main__':
+    main()

-- 
To view, visit https://gerrit.wikimedia.org/r/335373
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: Ifd28b6e15309efe13d69054c77ae190f741acb8a
Gerrit-PatchSet: 1
Gerrit-Project: operations/puppet
Gerrit-Branch: production
Gerrit-Owner: Rush <[email protected]>

_______________________________________________
MediaWiki-commits mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits

Reply via email to