Hi,
Let me take a moment and try and describe what it is I'm trying to do in
case my tack is all wrong.
We have several systems that process data for users. The programs the
users run all run from a shared space and run in user space at the users
discretion. I would like to use monit to alert when one of these
processes is started and have it track the memory and cpu usage, further
alerting on a condition where cpu or mem of that process exceeds a
certain threshold (and possibly renicing it via some script)
I've currently set up alerts like this:
check process process1
matching "process1"
mode passive
group processing
if cpu is greater than 90% for 5 cycles then alert
if memory is greater than 90% for 5 cycles then alert
check process process2
matching "process2"
mode passive
group processing
if cpu is greater than 90% for 5 cycles then alert
if memory is greater than 90% for 5 cycles then alert
check process process3
matching "process3"
mode passive
group processing
if cpu is greater than 90% for 5 cycles then alert
if memory is greater than 90% for 5 cycles then alert
...and it goes on for another dozen or so processes
This "works" but is not ideal
what would be ideal is more along the lines of
check process process1
matching "process1"
alert on statechange (basically ignore the fact this process is
not running but let me know when it starts and ends [i.e alert on state
a change] and monitor it when it is running)
mode passive
group processing
if cpu is greater than 90% for 5 cycles then alert
if memory is greater than 90% for 5 cycles then alert
Also we are using m/monit and every process on every machine that is NOT
running shows up as a hit against overall health
i.e.
under the host status:
Status 10 out of 27 services are available
and on the main dashboard:
×[Sep 16 2013 15:59:47] Host*myhost.example.com
<https://im-on-it.crbs.ucsd.edu/status/hosts/detail?id=1656>*reported a
problem with*process1***:process is not running
×[Sep 16 2013 15:59:44] Host*myhost.example.com
<https://im-on-it.crbs.ucsd.edu/status/hosts/detail?id=1656>*reported a
problem with*process2*:process is not running
×[Sep 16 2013 15:59:40] Host*myhost.example.com
<https://im-on-it.crbs.ucsd.edu/status/hosts/detail?id=1656>*reported a
problem with*process3*:process is not running
×[Sep 16 2013 15:59:35] Host*myhost.example.com
<https://im-on-it.crbs.ucsd.edu/status/hosts/detail?id=1656>*reported a
problem with*process4*:process is not running
multiplied by 20+ hosts
you get the idea.
The fact that the process isn't running is never a problem and I would
like to reflect that somehow and also be able to have some insight into
whats running where.
Another thing I would really like to be able to do is pass args in the
alert emails
i.e. when the command process1 -t foo -o bar -cfg process1.cfg -v -X -s
is run I'd be tickled if I could get "-t foo -o bar -cfg process1.cfg
-v -X -s" (or even the entire content of monit procmatch) into the
alert somehow
I've only had this up and running for about a month and monit has saved
my bacon on filesystem checks and dead services several times. Just
wanting to do a bit more than the system side of things with it.
--
To unsubscribe:
https://lists.nongnu.org/mailman/listinfo/monit-general