[monit] monit race conditions on Mac OS X 10.5 Leopard?

Sergio Trejo Sat, 19 Jan 2008 02:22:09 -0800

Hello,

I have monit (version 4.10.1) running on an Apple machine which is Mac OS X
Server (Leopard, 10.5.1). My installation of monit monitors six separate
daemons for these programs: Apache, Postfix, PostgreSQL, Tomcat, OpenLDAP,
and MySQL. My monit configuration file has entries that look like this for
all of the six aforementioned programs (taking Apache for example):


check process apache with pidfile "/opt/local/apache2/logs/httpd.pid" every
> 10 cycles
>     start = "/opt/local/apache2/bin/apachectl start"
>     stop = "/opt/local/apache2/bin/apachectl stop"
>     if failed port 80 and protocol http then restart
>     if 5 restarts within 5 cycles then timeout
>

Where my daemon frequency is set to 60 seconds as in:

set daemon 60
>

What is interesting is that I had all six of my daemons running as a
starting point and monit confirmed this (using the little http server built
into monit on port 2812). I then, very intentionally (as sort of an auditing
process) killed five out of my six daemons (the only daemon I left running
was the Postfix daemon because I still wanted to have monit be capable of
sending email alerts since I use the internal mail server running on the
same machine as Postfix, as in "set mailserver 127.0.0.1"). So, with five of
the six daemons intentionally killed, monit did successfully later catch up
and successfully re-started all five daemons. However, monit only generated
two mail message alerts:1

1. A message stating that the apache daemon did not exist

2. A message stating that the postgres daemon did exist (seemed to have sent
this message after re-starting PostgreSQL)

But, why didn't I receive ten messages, five of them for each daemon that I
intentionally killed stating that they did not exist, and then later on five
more messages stating that the five daemons (after being restarted) did
indeed exist again?

Also, why did I get the first message for apache saying it didn't exist
whereas the second message, should it also have stated that the apache
daemon existed again (instead of telling me that the postgres daemon
existed)?

It doesn't make sense. Is it possible that monit was "overwhelmed" or
overloaded in some way and became "confused"? I know that doesn't sound
appropriate for a binary system but there is nothing in the monit log file
to give me any hints. Perhaps, did monit experience a race condition?

The log file shows that all five daemons which I had manually killed were
restarted successfully (and indeed they were -- I ssh'ed into my server and
saw them all running again as processes and monit also reported their
successful running again on its http server on port 2812).

If this was a race condition, could there be an issue with threading? Mac OS
X 10.5 (Leopard and Leopard Server) might be different enough compared to
previous versions of Mac OS X with regard to a change to how threading works
(but I am writing this very vaguely without much information at the moment
other than some fuzzy recollection that something related to threading on
Leopard might have changed).

Thanks for any suggestions,

Serg

--
To unsubscribe:
http://lists.nongnu.org/mailman/listinfo/monit-general

[monit] monit race conditions on Mac OS X 10.5 Leopard?

Reply via email to