I've been running Nagios for years, and today have run into an issue that's got me banging my head against a wall.
I've got a distributed setup, basically with two Nagios 3.1.0 machines on Red Hat EL4 running the same checks simultaneously. Today they both started reporting a return code of 126 or 127 for various commands that are not missing, and do not have permissions that would not allow Nagios to run them. For example, this happens whenever a notification is attempted: [1255684343] Warning: Attempting to execute the command "/usr/bin/printf "%b" "***** Nagios *****\n\nNotification Type: PROBLEM\nNotification Number: 2\n\nService: MYSERVICE\nHost: myhost\nAddress: myhost.edited.com\nState: CRITICAL\n\nDate/Time: Fri Oct 16 02:12:22 PDT 2009\n\nAdditional Info:\n\n(Return code of 127 is out of bounds - plugin may be missing)\n\nComment: : \n\nWiki: https://wiki.link\n\nNagios: https://nagios/nagios/cgi-bin/extinfo.cgi?type=2&host=myhost&service=MYSERVICE" | /bin/mail -s "PROBLEM: myhost/MYSERVICE CRITICAL **" [email protected]" resulted in a return code of 127. Make sure the script or binary you are trying to execute actually exists... If I use "su - nagios" and copy and paste the failed command at a command prompt, it works. The notification commands very consistently return a 127, while various checks (but not all of them) will return a 126 or a 127. Stranger, the same exact plugin (check_http, for example) may work fine for one service, but return an error code for another. Now, my installation on this instance of Nagios is pretty large: 548 hosts and about 8500 services. The same check configurations and plugins, however, are synched across 24 other Nagios boxes and assigned to different hosts, and those all work just fine. It's just this, my biggest installation, where they've started failing. This feels to me like I've hit some sort of capacity limitation. I've pared down some things (like cutting a complicated escalation configuration from 24,000 escalations to 3,500), but that didn't help. I've offloaded half the checks to another system that submits passive results over nsca, but that didn't help either. I've played with a lot of tuning settings like limiting concurrent checks, spacing out an aggressively tuned check schedule, and generally just screwing with stuff, but nothing's worked, and I'm wondering if someone's run into this sort of thing before, and might be able to point me at something I haven't tried yet. For the record, there's no SELinux involved, and nothing unusual in the system logs. ------------------------------------------------------------------------------ Come build with us! The BlackBerry(R) Developer Conference in SF, CA is the only developer event you need to attend this year. Jumpstart your developing skills, take BlackBerry mobile applications to market and stay ahead of the curve. Join us from November 9 - 12, 2009. Register now! http://p.sf.net/sfu/devconference _______________________________________________ Nagios-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
