Hello, the start timeout waits only for the process itself to start - as soon as the process shows up in the process table, the start command is finished and the testing resumed. The restart doesn't reset the errors record - the "5 cycles" condition will then match immediately, as the cycles before the restart are counted as well.
We will change modify the restart command to reset the pre-restart error cycles. Also the timeout should temporarily suppress the errors from the same service tests till it expires. Regards, Martin On Aug 7, 2013, at 8:04 PM, David Paper <[email protected]> wrote: > Greetings, > > I've dug through the monit docs, examples and changelog from 5.2.3 to 5.5.1, > and I am unable to find a reference to this problem. Here is what I am > seeing. Using Monit 5.2.3 on RedHat linux 5.4 86_x64 platform. > > I have a process that locks up due to out of memory (java) and monit tries to > stop/start it. When I manually stop/start the process, monit waits the 180 > seconds before it begins testing, and can test successfully. The job works > as defined. The process takes more than 2 minutes to come online and start > listening for TCP requests. What doesn't work is that the monit restart > functionality appears to immediately test the port 1 second after restart, > again at 1 minute after restart, then sensing the process isn't working > correctly, tries to restart it, and the sequence begins all over. If I > didn't know better, I would say that Monit is ignoring the defined time/cycle > settings on a restart. > > My monit job for this process looks like this: > > check process jboss-ssp with pidfile /var/run/jboss/jboss-sspnode.pid > start program = "/opt/jboss/bin/monit_run.sh -c sspnode -b 10.91.51.32 > -g ssp-io-lp1 -u 239.255.150.1 -Djboss.messaging.ServerPeerID=1" > as uid 349 and as gid 349 with timeout 180 seconds > stop program = "/bin/bash -c 'kill -9 `cat > /var/run/jboss/jboss-sspnode.pid`'" > as uid 349 and as gid 349 > if failed host 10.91.51.141 port 8080 for 5 times within 5 cycles then > alert > if failed host 10.91.51.141 port 8080 for 5 times within 5 cycles then > restart > > Here is my monitrc: > > set daemon 60 # check services at 1-minute intervals > with start delay 60 # optional: delay the first check by 1-minute > set logfile syslog facility log_daemon > set idfile /var/run/monit.id > set statefile /var/run/monit.state > set mailserver smartmail.mydomain.com, # primary mailserver > set eventqueue > basedir /opt/monit/eventqueue #set the base directory where events will > be stored > slots 100 # optionally limit the queue size > set alert [email protected] # receive all alerts > set httpd port 2812 and > use address localhost # only accept connection from localhost > allow localhost # allow localhost to connect to the server and > include /opt/monit/monit.d/* > > The syslog messages that show monits behavior: > > Aug 7 04:02:26 stdeciovag1 monit[4111]: 'jboss-ssp' failed, cannot open a > connection to INET[10.91.51.141:8080] via TCP > Aug 7 04:02:26 stdeciovag1 monit[4111]: 'jboss-ssp' trying to restart > Aug 7 04:02:26 stdeciovag1 monit[4111]: 'jboss-ssp' stop: /bin/bash > Aug 7 04:02:27 stdeciovag1 monit[4111]: 'jboss-ssp' start: > /opt/jboss/bin/monit_run.sh > Aug 7 04:02:27 stdeciovag1 logger: Running /opt/jboss/bin/run.sh > Aug 7 04:02:28 stdeciovag1 monit[4111]: 'jboss-ssp' failed, cannot open a > connection to INET[10.91.51.141:8080] via TCP > Aug 7 04:03:28 stdeciovag1 monit[4111]: 'jboss-ssp' failed, cannot open a > connection to INET[10.91.51.141:8080] via TCP > Aug 7 04:03:28 stdeciovag1 monit[4111]: 'jboss-ssp' trying to restart > Aug 7 04:03:28 stdeciovag1 monit[4111]: 'jboss-ssp' stop: /bin/bash > Aug 7 04:03:29 stdeciovag1 monit[4111]: 'jboss-ssp' start: > /opt/jboss/bin/monit_run.sh > Aug 7 04:03:29 stdeciovag1 logger: Running /opt/DECE_jboss/bin/run.sh > Aug 7 04:03:30 stdeciovag1 monit[4111]: 'jboss-ssp' failed, cannot open a > connection to INET[10.91.51.141:8080] via TCP > Aug 7 04:04:30 stdeciovag1 monit[4111]: 'jboss-ssp' failed, cannot open a > connection to INET[10.91.51.141:8080] via TCP > Aug 7 04:04:30 stdeciovag1 monit[4111]: 'jboss-ssp' trying to restart > Aug 7 04:04:30 stdeciovag1 monit[4111]: 'jboss-ssp' stop: /bin/bash > Aug 7 04:04:31 stdeciovag1 monit[4111]: 'jboss-ssp' start: > /opt/jboss/bin/monit_run.sh > …. > > This goes on forever until someone manually intervenes and stops and starts > the monit job manually. > > Any help/guidance would be appreciated. > > Thanks, > > -dave > > > > > > -- > To unsubscribe: > https://lists.nongnu.org/mailman/listinfo/monit-general -- To unsubscribe: https://lists.nongnu.org/mailman/listinfo/monit-general
