[Canonical-ci-engineering] Long-running nagios checks

Evan Dandrea Mon, 07 Jul 2014 06:29:36 -0700

As mentioned in the team meeting, NRPE checks cannot take a long time
to complete without complications resulting. I spoke to James about
this a moment ago.


There's a 30 second response timeout in NRPE (nagios
remote-execution). The way they and other prodstack-deployed teams
work around this is by driving the test from cron. On success this
writes a success message into a file on disk. On failure it writes a
failure message into this file. NRPE then checks both that the
timestamp of this file is recent and that it contains the success
message. This covers both the cron job itself failing (the file
doesn't exist or hasn't been updated in a while) and the test itself
failing.

He said the code for this is buried in the depths of lp:canonical-is-puppet.

As one example, cron¹ calls the u1db engine status check², which calls
nagios' check_http on the local wsgi server and dumps the results to
disk³. This is then read by the nrpe-called check⁴.

¹ ./modules/ubuntuone/templates/u1db-engines-check-cron.erb
² /srv/<%= vhost_name %>/var/nagios/engines_status
³ ./modules/ubuntuone/templates/get_u1db_engines_status.sh.erb
⁴ ./modules/ubuntuone/files/check_u1db_engines_status.py

-- 
Mailing list: https://launchpad.net/~canonical-ci-engineering
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~canonical-ci-engineering
More help   : https://help.launchpad.net/ListHelp

[Canonical-ci-engineering] Long-running nagios checks

Reply via email to