On Thu, 27 Feb 2020 at 12:03, Rick Elrod <codebl...@elrod.me> wrote: > On Thu, Feb 27, 2020 at 4:31 AM Clement Verna <cve...@fedoraproject.org> > wrote: > > > > > > > > On Thu, 27 Feb 2020 at 06:53, Rick Elrod <codebl...@elrod.me> wrote: > >> > >> I'd like to apply the following which does: > >> - Adds a script I wrote for reading a timestamp from a file on disk > >> and alerting if the timestamp within it is NOT within a particular > >> delta to now. > >> - Applies this to sundries01 and uses it to check > >> /srv/websites/getfedora.org/build.timestamp.txt which now gets > >> generated as part of the websites build. > >> > >> The purpose is because sometimes someone will commit something to the > >> websites repo which breaks the build, but because of how we have > >> things set up in openshift (cronjob), we don't get any kind of alert > >> when that happens. > > > > > > I think it would be better to find a way to monitor the cronjob in > OpenShift since that will be useful for other projects. > > Did you investigate that idea ? > > > >> > >> > >> Right now this sets the delta to 3 hours. In theory it should be 1, > >> but I figure let it try to build a few times before we start alerting. > > > > > > +1 but I would prefer a way to have notification on a failed cronjob :-) > > I'd prefer that too (or probably in addition), but I don't know > anything about how to set up that monitoring right now. > It looks like there's an OpenShift API endpoint for monitoring crons: > https://major.io/2019/11/18/monitoring-openshift-cron-jobs/ > but we'd need to set up an API key for nagios checks to use somehow. >
Yes I think we would need to have a "nagios" service account, then that should give us a token to use for authentication. > Probably worth looking into, but for the time being I'd still like to > apply this FBR, as we are going to have some Outreachy activity > happening on websites soon and we need to know that the prod build > isn't broken. > > -re > > > > >> > >> > >> Rick > >> > >> > >> commit 657d050f6d699bc43973d968cd93d12131fca7f2 > >> Author: Rick Elrod <rel...@redhat.com> > >> Date: Thu Feb 27 05:29:24 2020 +0000 > >> > >> nagios: Add script and check for checking that a timestamp within > >> a file is within a delta of now, and then use this for alerting when > >> websites stop building > >> > >> Signed-off-by: Rick Elrod <rel...@redhat.com> > >> > >> diff --git a/roles/nagios_client/files/scripts/check_timestamp_from_file > >> b/roles/nagios_client/files/scripts/check_timestamp_from_file > >> new file mode 100644 > >> index 0000000..9064337 > >> --- /dev/null > >> +++ b/roles/nagios_client/files/scripts/check_timestamp_from_file > >> @@ -0,0 +1,43 @@ > >> +#!/usr/bin/env python > >> + > >> +# Takes a path to a file and a delta. The file must simply contain an > epoch > >> +# timestamp. It can be an integer or a float, as can the delta. > >> +# > >> +# Alerts critical if (now - timestamp contained in file) > delta. > >> +# > >> +# Rick Elrod <rel...@redhat.com> > >> +# MIT > >> + > >> +import sys > >> +import time > >> + > >> +if len(sys.argv) != 3: > >> + print('UNKNOWN: Pass path to file and delta as parameters') > >> + sys.exit(3) > >> + > >> +filename = sys.argv[1] > >> +delta = float(sys.argv[2]) > >> + > >> +timestamp = None > >> + > >> +try: > >> + with open(filename, 'r') as f: > >> + timestamp = float(f.read().strip()) > >> +except Exception as e: > >> + print('UNKNOWN: Unable to open/read file path') > >> + sys.exit(3) > >> + > >> +difference = round(time.time() - timestamp, 2) > >> +if difference > delta: > >> + print( > >> + 'CRITICAL: Timestamp in file (%.2f) exceeds delta (%.2f) by > >> %.2f seconds' % ( > >> + timestamp, > >> + delta, > >> + difference - delta)) > >> + sys.exit(2) > >> + > >> +print('OK: Timestamp in file (%.2f) is within delta (%.2f) of now, by > >> %.2f seconds' % ( > >> + timestamp, > >> + delta, > >> + abs(difference - delta))) > >> +sys.exit(0) > >> diff --git a/roles/nagios_client/tasks/main.yml > >> b/roles/nagios_client/tasks/main.yml > >> index 2e5e0df..8e71a3b 100644 > >> --- a/roles/nagios_client/tasks/main.yml > >> +++ b/roles/nagios_client/tasks/main.yml > >> @@ -47,6 +47,7 @@ > >> - check_osbs_api.py > >> - check_ipa_replication > >> - check_redis_queue.sh > >> + - check_timestamp_from_file > >> when: not inventory_hostname.startswith('noc') > >> tags: > >> - nagios_client > >> @@ -226,6 +227,16 @@ > >> tags: > >> - nagios_client > >> > >> +- name: install nrpe checks for sundries/websites > >> + template: src={{ item }}.j2 dest=/etc/nrpe.d/{{ item }} owner=root > >> group=root mode=0644 > >> + with_items: > >> + - check_websites_buildtime.cfg > >> + when: inventory_hostname.startswith('sundries') > >> + notify: > >> + - restart nrpe > >> + tags: > >> + - nagios_client > >> + > >> - name: install nrpe config for the RabbitMQ checks > >> template: > >> src: "rabbitmq_args.ini.j2" > >> diff --git > a/roles/nagios_client/templates/check_websites_buildtime.cfg.j2 > >> b/roles/nagios_client/templates/check_websites_buildtime.cfg.j2 > >> new file mode 100644 > >> index 0000000..ff5639d > >> --- /dev/null > >> +++ b/roles/nagios_client/templates/check_websites_buildtime.cfg.j2 > >> @@ -0,0 +1,2 @@ > >> +# Alert if websites haven't been built in 3 hours > >> +command[check_websites_buildtime]={{ libdir > >> }}/nagios/plugins/check_timestamp_from_file > >> /srv/websites/getfedora.org/build.timestamp.txt 10800 > >> diff --git > a/roles/nagios_server/templates/nagios/services/websites.cfg.j2 > >> b/roles/nagios_server/templates/nagios/services/websites.cfg.j2 > >> index 85e8f8e..c8958d7 100644 > >> --- a/roles/nagios_server/templates/nagios/services/websites.cfg.j2 > >> +++ b/roles/nagios_server/templates/nagios/services/websites.cfg.j2 > >> @@ -316,4 +316,14 @@ define service { > >> use ppc-secondarytemplate > >> } > >> > >> +## Auxillary to websites but necessary to make them happen > >> + > >> +define service { > >> + host_name sundries01.phx2.fedoraproject.org > >> + service_description websites build happened recently > >> + check_command check_by_nrpe!check_websites_buildtime > >> + use websitetemplate > >> +} > >> + > >> + > >> {% endif %} > >> _______________________________________________ > >> infrastructure mailing list -- infrastructure@lists.fedoraproject.org > >> To unsubscribe send an email to > infrastructure-le...@lists.fedoraproject.org > >> Fedora Code of Conduct: > https://docs.fedoraproject.org/en-US/project/code-of-conduct/ > >> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines > >> List Archives: > https://lists.fedoraproject.org/archives/list/infrastructure@lists.fedoraproject.org > > > > _______________________________________________ > > infrastructure mailing list -- infrastructure@lists.fedoraproject.org > > To unsubscribe send an email to > infrastructure-le...@lists.fedoraproject.org > > Fedora Code of Conduct: > https://docs.fedoraproject.org/en-US/project/code-of-conduct/ > > List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines > > List Archives: > https://lists.fedoraproject.org/archives/list/infrastructure@lists.fedoraproject.org > _______________________________________________ > infrastructure mailing list -- infrastructure@lists.fedoraproject.org > To unsubscribe send an email to > infrastructure-le...@lists.fedoraproject.org > Fedora Code of Conduct: > https://docs.fedoraproject.org/en-US/project/code-of-conduct/ > List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines > List Archives: > https://lists.fedoraproject.org/archives/list/infrastructure@lists.fedoraproject.org >
_______________________________________________ infrastructure mailing list -- infrastructure@lists.fedoraproject.org To unsubscribe send an email to infrastructure-le...@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/infrastructure@lists.fedoraproject.org