>-----Original Message----- >From: Thomas Guyot-Sionnest [mailto:[EMAIL PROTECTED] >Sent: Thursday, January 24, 2008 3:33 AM >To: Frost, Mark {PBG} >Cc: Nagios Users >Subject: Re: [Nagios-users] Problem with high latencies after >going distributed > >-----BEGIN PGP SIGNED MESSAGE----- >Hash: SHA1 > >Some heavily broken intending there (looks like my mail client gets >confused)... don't trust the number of ">"! > >On 23/01/08 10:47 PM, Frost, Mark {PBG} wrote: >> >> >>> -----Original Message----- >>> From: Thomas Guyot-Sionnest [mailto:[EMAIL PROTECTED] >>> Sent: Wednesday, January 23, 2008 10:24 PM >>> To: Frost, Mark {PBG} >>> Cc: Nagios Users >>> Subject: Re: [Nagios-users] Problem with high latencies after >>> going distributed >> I don't think so. I remember an email from Ton Voon some time >> ago asking >> Ethan why the oc[hs]p command are run serially but I don't recall if >> there was a reply or what else was said... >> >> I believe it's either documented in the official doc or some >> user-contributed doc that the oc[hs]p commands should return >as soon as >> possible. It's usually done in Perl using a fork: >> >> if (fork==0) { >> # send stuff via NSCA here... >> } >> exit(0); >> >> >>> I guess what I'm thinking here is that unlike a custom >check, I can't >>> see most >>> people needing to customize the passive check result >process. All the >>> solutions I've >>> seen seem to include a named pipe. So why couldn't Nagios support >>> making the ocsp/ochp >>> "commands" just named pipes instead. Then instead of a standalone >>> send_nsca binary, >>> have the nsca source build a send_nscaD binary (I'm making >that up) that >>> reads from the >>> pipe that nagios writes to and sends directly to nsca on the server. >>> That sort of >>> eliminates the middle-man in the process of reporting passive check >>> results. >> >>> I know, I know, I'm free to write the send_nscaD.c code and >send it to >>> Ethan :-) > >Well... I was thinking about partly re-writing nsca as an event-based >daemon (supporting only the --single mode, but that would be really >scalable) using libevent, allowing to pass along the timestamp > (this is >a recent feature request) and supporting multi-line responses (for >Nagios 3) in the process, and finally suggesting this as a base for a >NSCA v3... I'm not even sure if I would have enough time but since my >main objective it to learn I wouldn't loose anything trying :). > >In the unlikely event that I write it, In the same step I could surely >to a C version of OCP_Daemon supporting natively the "NSCA v3" protocol >(it wouldn't be hard)... > >I'll have to think about it... I quess the only sane separator to write >multiple multi-line results on a pipe would be \000 (NULL), so there >would be 3 mode of operation for send_nsca (and two for nsca_sendd >(don't you think it sounds better reversed?)): >send_nsca: compatible (v2 behavior), Single check (additional lines are >taken as additional output) and multi-check (NULL separated) >nsca_sendd: single-line (one check/line, OCP_Daemon style) and >multi-line "NULL-separated). > >> I don't know how many people use OCP_Daemon but I had reports >>>>from a few >> people that greatly reduced their latency using it and I >> haven't had any >> bug reported yet. I believe it's well documented as well, but If you >> have any feedback on this I'll be happy to get it. >> >>> I'm playing with it a bit and have so far had good results. > I'll have >>> some >>> feedback after I've played with it a bit longer. Thanks >for writing it >>> and >>> writing up the docs for it as well! > >Pass the thanks over to Ethan who sent me a Nagios NSA t-shirt >for it ;) > >Thomas
I can see that using the OCP Daemon script cut down on my latencies quite a lot. Unfortunately, I'm still seeing some "stale" checks on the master server that I can't explain. I'm starting to get the feeling that going distributed isn't all it's cracked up to be. I haven't seen mention in the docs of the caveats with oc[sh]p and latencies (my books sure don't mention it) and even the fact that the supplied submit_service_check script in the distribution from Ethan is a shell script that pipes to send_nsca. I'm not all that excited about having to do a workaround for this issue. While the OCP_Daemon seems to help me, I'm a little uncomfortable running it as a solution to our issue. First, we don't normally have root access on our boxes so recreating the FIFOs could be a problem (or at least a wait). I'm also concerned about requiring another process external to Nagios as part of the process. If OCP_Daemon dies at some point, my distributed nodes are hosed. I had a few issues with correctly starting Nagios and OCP_Daemon in the right order when playing with it last night. Once I got it all going, it worked well but I'm thinking of having to explain this to someone here who isn't the Nagios person. I was thinking of your fork/exec comment above. What if one were to rewrite the "glue" shell script (the one that takes the output from Nagios and pipes it to send_nsca) and do something similar, but write it in C? Additionally, have the parent fork and exit (causing Nagios to think the oc[sh]p completed very quickly) then have the child go on and send output to send_nsca separately. For my setup, this has the advantage of not being a separate process that I need to make sure continues to run. It also doesn't require synchronizing listeners on both ends of a pipe or else one process would hang. It would almost be even better, it seems to me, if this script could do the send_nsca functionality (again, as the child) instead of even having to call send_nsca. The biggest drawback I can see there is that you can't edit the C program to show destination server, etc. You'd just about have to pile on a ton of command line options or have a config file for it. Just thinking out loud. On a related note, I see that according to my performance stats, some checks are still taking a very long time to run. Is there some easy way I can see check execution time per check and track down which checks are taking such a long time? Thanks Mark ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ _______________________________________________ Nagios-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
