On Thu, 27 Feb 2020 at 06:53, Rick Elrod <[email protected]> wrote: > I'd like to apply the following which does: > - Adds a script I wrote for reading a timestamp from a file on disk > and alerting if the timestamp within it is NOT within a particular > delta to now. > - Applies this to sundries01 and uses it to check > /srv/websites/getfedora.org/build.timestamp.txt which now gets > generated as part of the websites build. > > The purpose is because sometimes someone will commit something to the > websites repo which breaks the build, but because of how we have > things set up in openshift (cronjob), we don't get any kind of alert > when that happens. >
I think it would be better to find a way to monitor the cronjob in OpenShift since that will be useful for other projects. Did you investigate that idea ? > > Right now this sets the delta to 3 hours. In theory it should be 1, > but I figure let it try to build a few times before we start alerting. > +1 but I would prefer a way to have notification on a failed cronjob :-) > > Rick > > > commit 657d050f6d699bc43973d968cd93d12131fca7f2 > Author: Rick Elrod <[email protected]> > Date: Thu Feb 27 05:29:24 2020 +0000 > > nagios: Add script and check for checking that a timestamp within > a file is within a delta of now, and then use this for alerting when > websites stop building > > Signed-off-by: Rick Elrod <[email protected]> > > diff --git a/roles/nagios_client/files/scripts/check_timestamp_from_file > b/roles/nagios_client/files/scripts/check_timestamp_from_file > new file mode 100644 > index 0000000..9064337 > --- /dev/null > +++ b/roles/nagios_client/files/scripts/check_timestamp_from_file > @@ -0,0 +1,43 @@ > +#!/usr/bin/env python > + > +# Takes a path to a file and a delta. The file must simply contain an > epoch > +# timestamp. It can be an integer or a float, as can the delta. > +# > +# Alerts critical if (now - timestamp contained in file) > delta. > +# > +# Rick Elrod <[email protected]> > +# MIT > + > +import sys > +import time > + > +if len(sys.argv) != 3: > + print('UNKNOWN: Pass path to file and delta as parameters') > + sys.exit(3) > + > +filename = sys.argv[1] > +delta = float(sys.argv[2]) > + > +timestamp = None > + > +try: > + with open(filename, 'r') as f: > + timestamp = float(f.read().strip()) > +except Exception as e: > + print('UNKNOWN: Unable to open/read file path') > + sys.exit(3) > + > +difference = round(time.time() - timestamp, 2) > +if difference > delta: > + print( > + 'CRITICAL: Timestamp in file (%.2f) exceeds delta (%.2f) by > %.2f seconds' % ( > + timestamp, > + delta, > + difference - delta)) > + sys.exit(2) > + > +print('OK: Timestamp in file (%.2f) is within delta (%.2f) of now, by > %.2f seconds' % ( > + timestamp, > + delta, > + abs(difference - delta))) > +sys.exit(0) > diff --git a/roles/nagios_client/tasks/main.yml > b/roles/nagios_client/tasks/main.yml > index 2e5e0df..8e71a3b 100644 > --- a/roles/nagios_client/tasks/main.yml > +++ b/roles/nagios_client/tasks/main.yml > @@ -47,6 +47,7 @@ > - check_osbs_api.py > - check_ipa_replication > - check_redis_queue.sh > + - check_timestamp_from_file > when: not inventory_hostname.startswith('noc') > tags: > - nagios_client > @@ -226,6 +227,16 @@ > tags: > - nagios_client > > +- name: install nrpe checks for sundries/websites > + template: src={{ item }}.j2 dest=/etc/nrpe.d/{{ item }} owner=root > group=root mode=0644 > + with_items: > + - check_websites_buildtime.cfg > + when: inventory_hostname.startswith('sundries') > + notify: > + - restart nrpe > + tags: > + - nagios_client > + > - name: install nrpe config for the RabbitMQ checks > template: > src: "rabbitmq_args.ini.j2" > diff --git a/roles/nagios_client/templates/check_websites_buildtime.cfg.j2 > b/roles/nagios_client/templates/check_websites_buildtime.cfg.j2 > new file mode 100644 > index 0000000..ff5639d > --- /dev/null > +++ b/roles/nagios_client/templates/check_websites_buildtime.cfg.j2 > @@ -0,0 +1,2 @@ > +# Alert if websites haven't been built in 3 hours > +command[check_websites_buildtime]={{ libdir > }}/nagios/plugins/check_timestamp_from_file > /srv/websites/getfedora.org/build.timestamp.txt 10800 > diff --git a/roles/nagios_server/templates/nagios/services/websites.cfg.j2 > b/roles/nagios_server/templates/nagios/services/websites.cfg.j2 > index 85e8f8e..c8958d7 100644 > --- a/roles/nagios_server/templates/nagios/services/websites.cfg.j2 > +++ b/roles/nagios_server/templates/nagios/services/websites.cfg.j2 > @@ -316,4 +316,14 @@ define service { > use ppc-secondarytemplate > } > > +## Auxillary to websites but necessary to make them happen > + > +define service { > + host_name sundries01.phx2.fedoraproject.org > + service_description websites build happened recently > + check_command check_by_nrpe!check_websites_buildtime > + use websitetemplate > +} > + > + > {% endif %} > _______________________________________________ > infrastructure mailing list -- [email protected] > To unsubscribe send an email to > [email protected] > Fedora Code of Conduct: > https://docs.fedoraproject.org/en-US/project/code-of-conduct/ > List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines > List Archives: > https://lists.fedoraproject.org/archives/list/[email protected] >
_______________________________________________ infrastructure mailing list -- [email protected] To unsubscribe send an email to [email protected] Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/[email protected]
