Hi Martin, Thanks for the detailed description...
I've attached my monitrc file. Obviously the executable (logClient) is an in-house exe, but that shouldn't matter, should it? I wonder if it is due to the amount of time it takes for the exe to update it's pid file... Seems as though it has something to do with the wait_start starting it's own thread to wait??? The scenario is rather simple. I can reproduce by stopping the service, then issuing a monit <service> start via the CLI. If this is not enough detail, or I can help out more, please let me know. Thanks, Aaron -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Martin Pala Sent: Wednesday, December 06, 2006 8:50 AM To: The monit developer list Subject: Re: <service> start Generates email noise I have looked on it ... I will first explain how it works in monit 4.8.2: Two threads come into play: - http thread - monitoring thread The http thread process the user requested actions (posted either using CLI or HTML interface). The action to be done is scheduled in http/cervlet.c:handle_action() via setting of the s->doaction flag for the appropriate service. When there is no action scheduled, the s->doaction flag is set to ACTION_IGNORE (in p.y during service initialization or in validate.c after it was handled). In addition the Run.doaction is set to TRUE just to signalize that there is some scheduled action in the service tree. The main monitoring thread is then wake up by http thread to speedup the action handling. The main thread then in validate.c:validate() checks whether the Run.doaction flag is set, since the user actions are preferred. In the case that it is set, it walks the service tree and for each service performs the scheduled s->doaction using control_service() and then resets the s->doaction flag to ACTION_IGNORE. This is all done under mutex and signal protection, so it cannot be interrupted nor race condition can occure. The only thread which can call control_service and physicaly start/restart/etc. the service is the main thread. The control_service also sets the s->visited flag. The second service loop is then evaluated - monit walks the service tree, for each service locks mutex and blocks signals. In the case that the service was not handled in the same cycle already (s->visited flag is compared in the check_skip) it checks the s->doaction flag again (to improve the response time for the services, which has scheduled action in between the first and second loop in the same cycle). In the case that it is set, it performs the action, otherwise it checks the service. The design is similar to signal handling. The http thread just sets the flag, whereas the monitoring thread handle the action. From theory point of view, i think no race condition could occure. I tried to reproduce the problem (official monit-4.8.2 release) without success. Can you prepare simple monit configuration and procedure for problem reproduction? Thanks, Martin Aaron Scamehorn wrote: > Hi Martin, > > Actually I think you've now got one thread doing an ACTION_START, and > another doing an ACTION_RESTART on the exact same service. > > It is the ACTION_RESTART that is generating what I perceived to be > extraneous emails. > > It looks like the do_wakeupcall that you added to > http/cervlet.c:handle_action() is the culprit. Without it, I don't get > the ACTION_RESTART problem. > > Of course you need this now, or else it takes Poll Time to actully > respond to the HTTP events, which is what you were trying to speed up in > the first place. > > Here is the log output, with a bunch of extra messages, including > pthread_t. > > 3086927552 [CST Dec 1 14:53:25] debug : 'data_dir' filesystem flags > has not changed since last cycle > 3086927552 [CST Dec 1 14:53:25] debug : 'data_dir' space usage check > passed [current space usage=10.6%] > 3086924720 [CST Dec 1 14:53:26] info : monit daemon at 24175 > awakened > 3086927552 [CST Dec 1 14:53:26] info : Awakened by User defined > signal 1 > 3086927552 [CST Dec 1 14:53:26] debug : control_service: > ACTION_START for 'LogClient' > 3086927552 [CST Dec 1 14:53:26] debug : control_service: > ACTION_START Util_isProcessRunning for 'LogClient' > 3086927552 [CST Dec 1 14:53:26] debug : 'LogClient' Error testing > process id [24220] -- No such process > 3086927552 [CST Dec 1 14:53:26] debug : do_start: > Util_isProcessRunning for 'LogClient' > 3086927552 [CST Dec 1 14:53:26] debug : 'LogClient' Error testing > process id [24220] -- No such process > 3086927552 [CST Dec 1 14:53:26] info : 'LogClient' start: > /cogcap/ccts/bin/logclnt > 3086927552 [CST Dec 1 14:53:26] debug : 'LogClient' Error testing > process id [24220] -- No such process > 3086927552 [CST Dec 1 14:53:26] debug : Monitoring enabled -- > service LogClient > 3086927552 [CST Dec 1 14:53:26] debug : check_process: calling > Util_isProcessRunning for 'LogClient' > 3086927552 [CST Dec 1 14:53:26] debug : 'LogClient' Error testing > process id [24220] -- No such process > 3086927552 [CST Dec 1 14:53:26] error : 'LogClient' process is not > running > 3086927552 [CST Dec 1 14:53:26] debug : Does not exist notification > is NOT sent to [EMAIL PROTECTED] > 3086927552 [CST Dec 1 14:53:26] debug : Does not exist notification > is sent to [EMAIL PROTECTED] > 3076434864 [CST Dec 1 14:53:26] debug : static void* wait_start for > 'LogClient' > 3076434864 [CST Dec 1 14:53:26] debug : 1) wait_start: calling > Util_isProcessRunning for 'LogClient', max_tries= 29 > 3076434864 [CST Dec 1 14:53:26] debug : 'LogClient' Error testing > process id [24220] -- No such process > 3086927552 [CST Dec 1 14:53:26] debug : control_service: > ACTION_RESTART for 'LogClient' > 3086927552 [CST Dec 1 14:53:26] info : 'LogClient' trying to > restart > 3086927552 [CST Dec 1 14:53:26] debug : Monitoring disabled -- > service LogClient (stop) > 3086927552 [CST Dec 1 14:53:26] debug : do_stop: > Util_isProcessRunning for 'LogClient' > 3086927552 [CST Dec 1 14:53:26] debug : 'LogClient' Error testing > process id [24220] -- No such process > 3086927552 [CST Dec 1 14:53:26] debug : 'data_dir' filesystem flags > has not changed since last cycle > 3086927552 [CST Dec 1 14:53:26] debug : 'data_dir' space usage check > passed [current space usage=10.6%] > 3076434864 [CST Dec 1 14:53:27] debug : 1) wait_start: calling > Util_isProcessRunning for 'LogClient', max_tries= 28 > 3076434864 [CST Dec 1 14:53:27] debug : 2) wait_start: calling > Util_isProcessRunning for 'LogClient' > 3086927552 [CST Dec 1 14:53:56] debug : check_process: calling > Util_isProcessRunning for 'LogClient' > 3086927552 [CST Dec 1 14:53:56] info : 'LogClient' process is > running with pid 24375 > 3086927552 [CST Dec 1 14:53:56] debug : Exists notification is NOT > sent to [EMAIL PROTECTED] > 3086927552 [CST Dec 1 14:53:56] debug : Exists notification is sent > to [EMAIL PROTECTED] > 3086927552 [CST Dec 1 14:53:56] debug : 'LogClient' zombie check > passed [status_flag=0000] > 3086927552 [CST Dec 1 14:53:56] debug : 'LogClient' loadavg(5min) > check passed [current loadavg(5min)=0.2] > 3086927552 [CST Dec 1 14:53:56] debug : 'LogClient' cpu usage check > passed [current cpu usage=0.0%] > 3086927552 [CST Dec 1 14:53:56] debug : 'LogClient' mem amount check > passed [current mem amount=2764kB] > 3086927552 [CST Dec 1 14:53:56] debug : 'data_dir' filesystem flags > has not changed since last cycle > 3086927552 [CST Dec 1 14:53:56] debug : 'data_dir' space usage check > passed [current space usage=10.6%] > > > -----Original Message----- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On > Behalf Of Martin Pala > Sent: Thursday, November 30, 2006 4:20 PM > To: The monit developer list > Subject: Re: <service> start Generates email noise > > Hello, > > this behavior isn't bug - the 'nonexist' event type has possitive and > negative variants: > > Does not exist (positive 'nonexist') > > vs. > > Exists (negative 'nonexist') > > The alert statement allows to filter just the general event type, not > the particular polarity (there is no 'exist' option). > > => when you have registered the 'nonexist' event, you should get two > alerts informing about the beggining and end of the problem. > > Martin > > > Aaron Scamehorn wrote: >> Hello, >> >> From version 4.8 to 4.8.2, the following bug has been introduced: >> >> When we issue a monit <service> start command, we get "Does not exist" > >> and a corresponding "Exists" emails. >> >> Here is the debug output showing this behavior in 4.8.2: >> 'LogClient' Error testing process id [11034] -- No such process >> 'LogClient' Error testing process id [11034] -- No such process >> 'LogClient' start: /cogcap/ccts/bin/logclnt >> 'LogClient' Error testing process id [11034] -- No such process >> Monitoring enabled -- service LogClient >> 'LogClient' Error testing process id [11034] -- No such process >> 'LogClient' process is not running >> Does not exist notification is sent to [EMAIL PROTECTED] >> 'LogClient' Error testing process id [11034] -- No such process >> 'LogClient' trying to restart >> Monitoring disabled -- service LogClient (stop) >> 'LogClient' Error testing process id [11034] -- No such process >> 'LogClient' process is running with pid 11189 >> Exists notification is sent to [EMAIL PROTECTED] >> 'LogClient' zombie check passed [status_flag=0000] >> 'LogClient' loadavg(5min) check passed [current loadavg(5min)=0.2] >> 'LogClient' cpu usage check passed [current cpu usage=0.0%] >> 'LogClient' mem amount check passed [current mem amount=2776kB] >> >> >> Under version 4.8, we don't get the annoying "Does not exist" and a >> corresponding "Exists" emails: >> >> 'LogClient' Error testing process id [10970] -- No such process >> 'LogClient' Error testing process id [10970] -- No such process >> 'LogClient' start: /cogcap/ccts/bin/logclnt >> 'LogClient' Error testing process id [10970] -- No such process >> Monitoring enabled -- service LogClient >> 'LogClient' Error testing process id [10970] -- No such process >> 'LogClient' Error testing process id [10970] -- No such process >> 'LogClient' zombie check passed [status_flag=0000] >> 'LogClient' loadavg(5min) check passed [current loadavg(5min)=0.1] >> 'LogClient' cpu usage check passed [current cpu usage=0.0%] >> 'LogClient' mem amount check passed [current mem amount=2776kB] >> >> >> >> Additionally, in our config file, we have the following set: >> set alert [EMAIL PROTECTED] only on { nonexist, exec, connection } >> >> We shouldn't be getting an "Exists" email under any circumstance, > should >> we? >> >> Thanks, >> Aaron >> >> >> > ------------------------------------------------------------------------ >> _______________________________________________ >> monit-dev mailing list >> [email protected] >> http://lists.nongnu.org/mailman/listinfo/monit-dev > > > _______________________________________________ > monit-dev mailing list > [email protected] > http://lists.nongnu.org/mailman/listinfo/monit-dev > > > _______________________________________________ > monit-dev mailing list > [email protected] > http://lists.nongnu.org/mailman/listinfo/monit-dev _______________________________________________ monit-dev mailing list [email protected] http://lists.nongnu.org/mailman/listinfo/monit-dev
monitrc.punisher
Description: monitrc.punisher
_______________________________________________ monit-dev mailing list [email protected] http://lists.nongnu.org/mailman/listinfo/monit-dev
