--On October 13, 2006 9:41:52 AM -0400 Bill Chmura <[EMAIL PROTECTED]> 
wrote:

>
> Hello,
>
> Yesterday I installed two temperature sensors in my server room.  I set
> them both for 10 degrees higher than the current.  Well, the building
> people raise the temperature up at night to save on energy.
>
> I do have my own cooling system in there, but it did not compensate for
> the building raising and set off the alarms.
>
> My threshold was for 75 degrees and the peek it went up to was 76.
> Unfortunately it paged me around 75 times last night.
>

Ah, I believe you've just learned the first lesson of monitoring...  Never 
enable paging on a new test/service until you've run the monitoring test 
for a while first.



> So it basically went like this:
>
> ALERT (temp 75.7)
> UPALERT (temp 75.3)
> ALERT (temp 75.4)
> UPALERT (temp 75.6)
> etc, etc...
>
> All of these are above the stated MAX limit of 75.  For some reason,
> ever other one is coming as good news - even though the temperature
> could have gone up.
>
>
> I am going to spend part of today insuring I can sleep tonight (first by
> raising the MAX temp) by solving this - but if anyone has any thoughts
> on this - i would love to hear them.


I have a suspicion of whats going on here.  I believe the current mon 
version has a feature (or bug, depending on your point of view) where the 
UPALERT summary & detail messages are actualy from the last failure, not 
from the OK test.  I suspect the temperature was actually crossing the 
threshold repeatedly, something like this:

test 1:  75.7 -> ALERT 75.7
test 2: 75.3 (no alert)
test 3: <75 -> UPALERT 75.3
etc...

There has been debate in the past about whether providing the 'last 
failure' content is useful for indicating what failure ended, or is 
confusing because it looks like its saying that state is OK.  I feel its 
confusing, and at CMU we're running with a patched mon that provides the 
success output during an
upalert.

I can't remember right now whether a decision was made about changing this 
behavior.  If we decided to change it, the change must have gotten missed 
during one of the big merges between Jim's alert structure rewrites and my 
behavior changes.

So, the messages you got were confusing, but the temperature was probably 
crossing your threshold repeatedly.  You might want to experiment with 
putting a longer threshold in place before you alert, i.e. 'alertafter 3'. 
Or you could de-bounce the monitor test somehow.  Maybe configure it with 
two values, a low-water mark and a high-water mark, and exit with different 
exit codes. e.g. use exit code 1 when temperature is 75-78, exit code 2 
with temperature over 78.  Then you could only send email on temperatures 
in the 75-78 range, and page on temps over 78.

-David



_______________________________________________
mon mailing list
mon@linux.kernel.org
http://linux.kernel.org/mailman/listinfo/mon

Reply via email to