Monit triggering restart storm

Guillaume François Thu, 09 Nov 2017 03:09:13 -0800

Hi,

I have a bunch of Monit rules to perform check on a service


   1. One check process rule (existence and port checks)
      1. does not exist for 5 cycles then start
      2.  failed port XXXX for 6 times within 8 cycles then restart
      3.  failed port YYYY for 6 times within 8 cycles then restart
      4.  failed port ZZZZ for 6 times within 8 cycles then restart
      2. Three check program rules with custom checks
   1. if status != 0 for 5 times within 10 cycles then restart
      2. if status != 0 for 5 times within 10 cycles then restart
      3. if status != 0 for 5 times within 10 cycles then restart
      3. One to check log content
      1. check file  + if content = "BIG ERROR" then restart

start/stop rules are

start program = "/bin/systemctl start myservice"
stop program = "/bin/systemctl stop myservice"

There are no dependency at Monit level but checks are part of the same
bunch of groups.

Problem, is that due to multiple issues, I got a "restart" storm as

   1. some  port check failed -> restart issued
   2. lead to error at custom script -> restart issued
   3. content log reading has some lags -> restart issued

Myservice or system.d configuration/feature are not well designed so got
"already bind exception" as system.d tried to start several instance at the
same time🤔

So port check failed again, system.d killed the wrong one, MyService was
blocked, restart again. etc.....

I had to shutdown Monit to prevent further action (I could have monit -g
group unmonitor also), kill every instance of my service, start it
correctly, then reactivate Monit


Question:

   - Is there a native way to prevent Monit to issue the same start/stop
   commands in a defined time-frame ?
   - Does Monit dependency feature between checks could help as I don't see
   how it could help ?
   - Any other hint/proposal (aside increasing the values of "for N times
   within T cycles" to delay the restart)

Remark: maybe exploring system.D features StartLimitIntervalSe &
StartLimitBurst could help.


Best Regards.

-- 
To unsubscribe:
https://lists.nongnu.org/mailman/listinfo/monit-general

Monit triggering restart storm

Reply via email to