I've recently begun an effort to move our Nagios installation to a distributed architecture from a centralized one. I had previous used NSCA only for a very few passive checks and it works fine on a 32-bit Red Hat AS 3 platform (the centralized server).
In testing on a distributed architecture (which is 64-bit Suse Linux Enterprise Server (SLES) 10), I seem to have a problem with NSCA. (Note that all Nagios and NSCA binaries and libraries were recompiled on the 64-bit platform). After I broke out all the checks to have 2 separate distributed nodes send to a central server, I saw a few messages like this one in the nagios.log file: [1200583727] Warning: Passive check result was received for service '0' on host 'HOSTXXX', but the service could not be found! but only about every 1 out of 10 or maybe 20 results was doing this. That is, the rest of the results were being correctly shown as "EXTERNAL COMMAND" and all expected NSCA fields came up correctly (hostname, service desc, check result, text output). I started having the "send_nsca" script from the distbributed nodes log what they were sending to a file. When I correlate what they're sending with what the NSCA daemon thinks it's receiving, the client is still sending the correct 4 fields, but it's as if the NSCA daemon is dropping the 2nd field (service desc) and replacing it with the check result field. So ultimately, it thinks the service name is '0'. I can't see that this matches a pattern (i.e. always on the same hosts or same service checks). All I've seen so far is that it happens whether I run NSCA as --single or --daemon. It also happens even if I turn off one of the distributed nodes (that is, I can't see it being volume related). I have turned on debugging in the NSCA daemon to see what it thinks it's getting and it echoes what the nagios.log shows: SERVICE CHECK -> Host Name: 'HOSTXXX', Service Description: '0', Return Code: '0', Output: ' rta=0.140000 ms)' Again, maybe only 1 out of 10. Ultimately, this causes the server to run an active check as it thinks it never got a result from the distbributed node. I'm still trying to dig deeper, but it seems to me that this is increasingly pointing to some issue with 64-bit SLES. Or perhaps some variable type in NSCA daemon that's not quite right for 64-bit. It's hard to tell with its intermittent nature and the fact that I have yet to discover a pattern. Has anyone seen anything like this before? Thanks Mark ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ _______________________________________________ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null