As mentioned in the team meeting, NRPE checks cannot take a long time to complete without complications resulting. I spoke to James about this a moment ago.
There's a 30 second response timeout in NRPE (nagios remote-execution). The way they and other prodstack-deployed teams work around this is by driving the test from cron. On success this writes a success message into a file on disk. On failure it writes a failure message into this file. NRPE then checks both that the timestamp of this file is recent and that it contains the success message. This covers both the cron job itself failing (the file doesn't exist or hasn't been updated in a while) and the test itself failing. He said the code for this is buried in the depths of lp:canonical-is-puppet. As one example, cron¹ calls the u1db engine status check², which calls nagios' check_http on the local wsgi server and dumps the results to disk³. This is then read by the nrpe-called check⁴. ¹ ./modules/ubuntuone/templates/u1db-engines-check-cron.erb ² /srv/<%= vhost_name %>/var/nagios/engines_status ³ ./modules/ubuntuone/templates/get_u1db_engines_status.sh.erb ⁴ ./modules/ubuntuone/files/check_u1db_engines_status.py -- Mailing list: https://launchpad.net/~canonical-ci-engineering Post to : [email protected] Unsubscribe : https://launchpad.net/~canonical-ci-engineering More help : https://help.launchpad.net/ListHelp

