Re: Monit triggering restart storm

[email protected] Thu, 09 Nov 2017 04:16:12 -0800

Hi,

if the start/stop methods are the same "/bin/systemctl [start|stop] myservice", 
then the solution should be the dependency of all 'check program' and 'check 
file' on the 'check process' parent.


If the dependant checks need to restart the parent process, they should do so 
via "... exec /usr/bin/monit restart myprocess" (unfortunately it's necessary 
to use the exec with monit CLI as there is currently no direct 
start/stop/restart action that would allow to pass the action to other check by 
name).

If the parent process will fail (for example the process is not running or port 
failed), the dependant checks will be aware about the parent restart and won't 
trigger another restart.


Example:

--8<--
check process myprocess matching "foobar"
    start program = "/bin/systemctl start myservice"
    stop program = "/bin/systemctl stop myservice"
    if does not exist for 5 cycles then start
    if failed port XXXX for 6 times within 8 cycles then restart
    if failed port YYYY for 6 times within 8 cycles then restart
    if failed port ZZZZ for 6 times within 8 cycles then restart

check program myprocess_collector with path 
"/usr/bin/collect_report_from_myprocess.sh"
    if status != 0 for 5 times within 10 cycles then exec "/usr/bin/monit 
restart myprocess"
    depends on myprocess

....

check program myprocess_log with path 
"/usr/bin/collect_report_from_myprocess.sh"
    if content = "BIG ERROR" then exec "/usr/bin/monit restart myprocess"
    depends on myprocess
--8<--


Best regards,
Martin



> On 9 Nov 2017, at 12:07, Guillaume François <[email protected]> 
> wrote:
> 
> Hi,
> 
> I have a bunch of Monit rules to perform check on a service
> One check process rule (existence and port checks)
> does not exist for 5 cycles then start 
>  failed port XXXX for 6 times within 8 cycles then restart
>  failed port YYYY for 6 times within 8 cycles then restart
>  failed port ZZZZ for 6 times within 8 cycles then restart
> Three check program rules with custom checks
> if status != 0 for 5 times within 10 cycles then restart
> if status != 0 for 5 times within 10 cycles then restart
> if status != 0 for 5 times within 10 cycles then restart
> One to check log content
> check file  + if content = "BIG ERROR" then restart
> start/stop rules are 
> 
>       start program = "/bin/systemctl start myservice"
>       stop program = "/bin/systemctl stop myservice"
> 
> There are no dependency at Monit level but checks are part of the same bunch 
> of groups.
> 
> Problem, is that due to multiple issues, I got a "restart" storm as
> some  port check failed -> restart issued
> lead to error at custom script -> restart issued
> content log reading has some lags -> restart issued
> Myservice or system.d configuration/feature are not well designed so got 
> "already bind exception" as system.d tried to start several instance at the 
> same time🤔 
> 
> So port check failed again, system.d killed the wrong one, MyService was 
> blocked, restart again. etc.....
> 
> I had to shutdown Monit to prevent further action (I could have monit -g 
> group unmonitor also), kill every instance of my service, start it correctly, 
> then reactivate Monit
> 
> 
> Question: 
> Is there a native way to prevent Monit to issue the same start/stop commands 
> in a defined time-frame ?
> Does Monit dependency feature between checks could help as I don't see how it 
> could help ?
> Any other hint/proposal (aside increasing the values of "for N times within T 
> cycles" to delay the restart)
> Remark: maybe exploring system.D features StartLimitIntervalSe & 
> StartLimitBurst could help.
> 
> 
> Best Regards.
> -- 
> To unsubscribe:
> https://lists.nongnu.org/mailman/listinfo/monit-general

-- 
To unsubscribe:
https://lists.nongnu.org/mailman/listinfo/monit-general

Re: Monit triggering restart storm

Reply via email to