Hello all, given a fairly well-running monitoring setup with about 18k services I thought I had understood the basics. However, the following leaves me clueless, and I hope I'm merely missing something obvious here:
On an up-to-date Debian Squeeze (i386) OpenVZ guest I have established that my monitoring user can execute a given command: root@vserv08:/# sudo -u monitor -i /usr/lib/nagios/plugins/check_dummy 0 success; echo Exitcode: $? OK: success Exitcode: 0 So far, so good. Now entering NRPE, using a stripped-down config for illustrating the point: root@vserv08:/# grep -v -e '^$' -e '^#' /etc/nagios/nrpe.cfg debug=1 nrpe_user=monitor nrpe_group=monitor allowed_hosts=127.0.0.1 command[dummy]=/usr/lib/nagios/plugins/check_dummy 0 success root@vserv08:/# ps auxww | grep '[/]usr/sbin/nrpe' monitor 7215 0.0 0.1 3704 892 ? Ss 15:20 0:00 /usr/sbin/nrpe -c /etc/nagios/nrpe.cfg -d The process startup logged as follows: Oct 8 15:20:22 vserv08 nrpe[7214]: Added command[dummy]=/usr/lib/nagios/plugins/check_dummy 0 success Oct 8 15:20:22 vserv08 nrpe[7214]: INFO: SSL/TLS initialized. All network traffic will be encrypted. Oct 8 15:20:22 vserv08 nrpe[7215]: Starting up daemon Oct 8 15:20:22 vserv08 nrpe[7215]: Listening for connections on port 5666 Oct 8 15:20:22 vserv08 nrpe[7215]: Allowing connections from: 127.0.0.1 However, executing the dummy command won't work: root@vserv08:/# /usr/lib/nagios/plugins/check_nrpe -H 127.0.0.1 -c dummy NRPE: Unable to read output This has been logged as: Oct 8 15:21:36 vserv08 nrpe[7234]: Connection from 127.0.0.1 port 48791 Oct 8 15:21:36 vserv08 nrpe[7234]: Host address is in allowed_hosts Oct 8 15:21:36 vserv08 nrpe[7234]: Handling the connection... Oct 8 15:21:36 vserv08 nrpe[7234]: Host is asking for command 'dummy' to be run... Oct 8 15:21:36 vserv08 nrpe[7234]: Running command: /usr/lib/nagios/plugins/check_dummy 0 success Oct 8 15:21:36 vserv08 nrpe[7234]: Command completed with return code 2 and output: Oct 8 15:21:36 vserv08 nrpe[7234]: Return Code: 2, Output: NRPE: Unable to read output Oct 8 15:21:36 vserv08 nrpe[7234]: Connection from 127.0.0.1 closed. This strikes me as weird: nrpe tries to execute the defined command, but somehow no output shows up. I know of the peculiarities that might arise once sudo joins the team or when permissions aren't set appropriately, but this doesn't apply here. Playing around with the dummy command (substituting a shell script, sprinkling '| tee -a logfile' into the code, ...) revealed that indeed the desired text output is generated but somehow gets discarded. Perhaps the monitoring user or even the whole system is subtly broken, but given that there are ~400 similiarily setup systems (all using the same workflow/automatisms for deploying the monitoring infrastructure) I was starting to wonder how that might have happened ... However, it got weirder: if I strace the nrpe process, everything works as desired: root@vserv08:/# strace -f -o /root/log -p 7215 And then in another terminal: root@vserv08:/# /usr/lib/nagios/plugins/check_nrpe -H 127.0.0.1 -c dummy OK: success Logged as follows: Oct 8 15:21:57 vserv08 nrpe[7240]: Connection from 127.0.0.1 port 37275 Oct 8 15:21:57 vserv08 nrpe[7240]: Host address is in allowed_hosts Oct 8 15:21:57 vserv08 nrpe[7240]: Handling the connection... Oct 8 15:21:57 vserv08 nrpe[7240]: Host is asking for command 'dummy' to be run... Oct 8 15:21:57 vserv08 nrpe[7240]: Running command: /usr/lib/nagios/plugins/check_dummy 0 success Oct 8 15:21:57 vserv08 nrpe[7240]: Command completed with return code 0 and output: OK: success Oct 8 15:21:57 vserv08 nrpe[7240]: Return Code: 0, Output: OK: success Oct 8 15:21:57 vserv08 nrpe[7240]: Connection from 127.0.0.1 closed. I found no further hints in the strace log, but this led me to assume that there is some NRPE weirdness involved, and thus I'm writing here instead of further digging through the system. Any ideas? Cheers, Flo ------------------------------------------------------------------------------ Don't let slow site performance ruin your business. Deploy New Relic APM Deploy New Relic app performance management and know exactly what is happening inside your Ruby, Python, PHP, Java, and .NET app Try New Relic at no cost today and get our sweet Data Nerd shirt too! http://p.sf.net/sfu/newrelic-dev2dev _______________________________________________ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null