On Thu, 27 Feb 2020 at 12:03, Rick Elrod <codebl...@elrod.me> wrote:

> On Thu, Feb 27, 2020 at 4:31 AM Clement Verna <cve...@fedoraproject.org>
> wrote:
> >
> >
> >
> > On Thu, 27 Feb 2020 at 06:53, Rick Elrod <codebl...@elrod.me> wrote:
> >>
> >> I'd like to apply the following which does:
> >> - Adds a script I wrote for reading a timestamp from a file on disk
> >> and alerting if the timestamp within it is NOT within a particular
> >> delta to now.
> >> - Applies this to sundries01 and uses it to check
> >> /srv/websites/getfedora.org/build.timestamp.txt which now gets
> >> generated as part of the websites build.
> >>
> >> The purpose is because sometimes someone will commit something to the
> >> websites repo which breaks the build, but because of how we have
> >> things set up in openshift (cronjob), we don't get any kind of alert
> >> when that happens.
> >
> >
> > I think it would be better to find a way to monitor the cronjob in
> OpenShift since that will be useful for other projects.
> > Did you investigate that idea ?
> >
> >>
> >>
> >> Right now this sets the delta to 3 hours. In theory it should be 1,
> >> but I figure let it try to build a few times before we start alerting.
> >
> >
> > +1 but I would prefer a way to have notification on a failed cronjob :-)
>
> I'd prefer that too (or probably in addition), but I don't know
> anything about how to set up that monitoring right now.
> It looks like there's an OpenShift API endpoint for monitoring crons:
> https://major.io/2019/11/18/monitoring-openshift-cron-jobs/
> but we'd need to set up an API key for nagios checks to use somehow.
>

Yes I think we would need to have a "nagios" service account, then that
should give us a token to use for authentication.


> Probably worth looking into, but for the time being I'd still like to
> apply this FBR, as we are going to have some Outreachy activity
> happening on websites soon and we need to know that the prod build
> isn't broken.
>

> -re
>
> >
> >>
> >>
> >> Rick
> >>
> >>
> >> commit 657d050f6d699bc43973d968cd93d12131fca7f2
> >> Author: Rick Elrod <rel...@redhat.com>
> >> Date:   Thu Feb 27 05:29:24 2020 +0000
> >>
> >>     nagios: Add script and check for checking that a timestamp within
> >> a file is within a delta of now, and then use this for alerting when
> >> websites stop building
> >>
> >>     Signed-off-by: Rick Elrod <rel...@redhat.com>
> >>
> >> diff --git a/roles/nagios_client/files/scripts/check_timestamp_from_file
> >> b/roles/nagios_client/files/scripts/check_timestamp_from_file
> >> new file mode 100644
> >> index 0000000..9064337
> >> --- /dev/null
> >> +++ b/roles/nagios_client/files/scripts/check_timestamp_from_file
> >> @@ -0,0 +1,43 @@
> >> +#!/usr/bin/env python
> >> +
> >> +# Takes a path to a file and a delta. The file must simply contain an
> epoch
> >> +# timestamp. It can be an integer or a float, as can the delta.
> >> +#
> >> +# Alerts critical if (now - timestamp contained in file) > delta.
> >> +#
> >> +# Rick Elrod <rel...@redhat.com>
> >> +# MIT
> >> +
> >> +import sys
> >> +import time
> >> +
> >> +if len(sys.argv) != 3:
> >> +    print('UNKNOWN: Pass path to file and delta as parameters')
> >> +    sys.exit(3)
> >> +
> >> +filename = sys.argv[1]
> >> +delta = float(sys.argv[2])
> >> +
> >> +timestamp = None
> >> +
> >> +try:
> >> +    with open(filename, 'r') as f:
> >> +        timestamp = float(f.read().strip())
> >> +except Exception as e:
> >> +    print('UNKNOWN: Unable to open/read file path')
> >> +    sys.exit(3)
> >> +
> >> +difference = round(time.time() - timestamp, 2)
> >> +if difference > delta:
> >> +    print(
> >> +        'CRITICAL: Timestamp in file (%.2f) exceeds delta (%.2f) by
> >> %.2f seconds' % (
> >> +            timestamp,
> >> +            delta,
> >> +            difference - delta))
> >> +    sys.exit(2)
> >> +
> >> +print('OK: Timestamp in file (%.2f) is within delta (%.2f) of now, by
> >> %.2f seconds' % (
> >> +    timestamp,
> >> +    delta,
> >> +    abs(difference - delta)))
> >> +sys.exit(0)
> >> diff --git a/roles/nagios_client/tasks/main.yml
> >> b/roles/nagios_client/tasks/main.yml
> >> index 2e5e0df..8e71a3b 100644
> >> --- a/roles/nagios_client/tasks/main.yml
> >> +++ b/roles/nagios_client/tasks/main.yml
> >> @@ -47,6 +47,7 @@
> >>    - check_osbs_api.py
> >>    - check_ipa_replication
> >>    - check_redis_queue.sh
> >> +  - check_timestamp_from_file
> >>    when: not inventory_hostname.startswith('noc')
> >>    tags:
> >>    - nagios_client
> >> @@ -226,6 +227,16 @@
> >>    tags:
> >>    - nagios_client
> >>
> >> +- name: install nrpe checks for sundries/websites
> >> +  template: src={{ item }}.j2 dest=/etc/nrpe.d/{{ item }} owner=root
> >> group=root mode=0644
> >> +  with_items:
> >> +  - check_websites_buildtime.cfg
> >> +  when: inventory_hostname.startswith('sundries')
> >> +  notify:
> >> +  - restart nrpe
> >> +  tags:
> >> +  - nagios_client
> >> +
> >>  - name: install nrpe config for the RabbitMQ checks
> >>    template:
> >>      src: "rabbitmq_args.ini.j2"
> >> diff --git
> a/roles/nagios_client/templates/check_websites_buildtime.cfg.j2
> >> b/roles/nagios_client/templates/check_websites_buildtime.cfg.j2
> >> new file mode 100644
> >> index 0000000..ff5639d
> >> --- /dev/null
> >> +++ b/roles/nagios_client/templates/check_websites_buildtime.cfg.j2
> >> @@ -0,0 +1,2 @@
> >> +# Alert if websites haven't been built in 3 hours
> >> +command[check_websites_buildtime]={{ libdir
> >> }}/nagios/plugins/check_timestamp_from_file
> >> /srv/websites/getfedora.org/build.timestamp.txt 10800
> >> diff --git
> a/roles/nagios_server/templates/nagios/services/websites.cfg.j2
> >> b/roles/nagios_server/templates/nagios/services/websites.cfg.j2
> >> index 85e8f8e..c8958d7 100644
> >> --- a/roles/nagios_server/templates/nagios/services/websites.cfg.j2
> >> +++ b/roles/nagios_server/templates/nagios/services/websites.cfg.j2
> >> @@ -316,4 +316,14 @@ define service {
> >>    use                   ppc-secondarytemplate
> >>  }
> >>
> >> +## Auxillary to websites but necessary to make them happen
> >> +
> >> +define service {
> >> +  host_name             sundries01.phx2.fedoraproject.org
> >> +  service_description   websites build happened recently
> >> +  check_command         check_by_nrpe!check_websites_buildtime
> >> +  use                   websitetemplate
> >> +}
> >> +
> >> +
> >>  {% endif %}
> >> _______________________________________________
> >> infrastructure mailing list -- infrastructure@lists.fedoraproject.org
> >> To unsubscribe send an email to
> infrastructure-le...@lists.fedoraproject.org
> >> Fedora Code of Conduct:
> https://docs.fedoraproject.org/en-US/project/code-of-conduct/
> >> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
> >> List Archives:
> https://lists.fedoraproject.org/archives/list/infrastructure@lists.fedoraproject.org
> >
> > _______________________________________________
> > infrastructure mailing list -- infrastructure@lists.fedoraproject.org
> > To unsubscribe send an email to
> infrastructure-le...@lists.fedoraproject.org
> > Fedora Code of Conduct:
> https://docs.fedoraproject.org/en-US/project/code-of-conduct/
> > List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
> > List Archives:
> https://lists.fedoraproject.org/archives/list/infrastructure@lists.fedoraproject.org
> _______________________________________________
> infrastructure mailing list -- infrastructure@lists.fedoraproject.org
> To unsubscribe send an email to
> infrastructure-le...@lists.fedoraproject.org
> Fedora Code of Conduct:
> https://docs.fedoraproject.org/en-US/project/code-of-conduct/
> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
> List Archives:
> https://lists.fedoraproject.org/archives/list/infrastructure@lists.fedoraproject.org
>
_______________________________________________
infrastructure mailing list -- infrastructure@lists.fedoraproject.org
To unsubscribe send an email to infrastructure-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/infrastructure@lists.fedoraproject.org

Reply via email to