David Nolan wrote:
> 
> --On October 13, 2006 9:41:52 AM -0400 Bill Chmura <[EMAIL PROTECTED]> 
> wrote:
> 
>> Hello,
>>
>> Yesterday I installed two temperature sensors in my server room.  I set
>> them both for 10 degrees higher than the current.  Well, the building
>> people raise the temperature up at night to save on energy.
>>
>> I do have my own cooling system in there, but it did not compensate for
>> the building raising and set off the alarms.
>>
>> My threshold was for 75 degrees and the peek it went up to was 76.
>> Unfortunately it paged me around 75 times last night.
>>
> 
> Ah, I believe you've just learned the first lesson of monitoring...  Never 
> enable paging on a new test/service until you've run the monitoring test 
> for a while first.
> 
> 
> 
>> So it basically went like this:
>>
>> ALERT (temp 75.7)
>> UPALERT (temp 75.3)
>> ALERT (temp 75.4)
>> UPALERT (temp 75.6)
>> etc, etc...
>>
>> All of these are above the stated MAX limit of 75.  For some reason,
>> ever other one is coming as good news - even though the temperature
>> could have gone up.
>>
>>
>> I am going to spend part of today insuring I can sleep tonight (first by
>> raising the MAX temp) by solving this - but if anyone has any thoughts
>> on this - i would love to hear them.
> 
> 
> I have a suspicion of whats going on here.  I believe the current mon 
> version has a feature (or bug, depending on your point of view) where the 
> UPALERT summary & detail messages are actualy from the last failure, not 
> from the OK test.  I suspect the temperature was actually crossing the 
> threshold repeatedly, something like this:
> 
> test 1:  75.7 -> ALERT 75.7
> test 2: 75.3 (no alert)
> test 3: <75 -> UPALERT 75.3
> etc...
> 
> There has been debate in the past about whether providing the 'last 
> failure' content is useful for indicating what failure ended, or is 
> confusing because it looks like its saying that state is OK.  I feel its 
> confusing, and at CMU we're running with a patched mon that provides the 
> success output during an
> upalert.
> 
> I can't remember right now whether a decision was made about changing this 
> behavior.  If we decided to change it, the change must have gotten missed 
> during one of the big merges between Jim's alert structure rewrites and my 
> behavior changes.
> 
> So, the messages you got were confusing, but the temperature was probably 
> crossing your threshold repeatedly.  You might want to experiment with 
> putting a longer threshold in place before you alert, i.e. 'alertafter 3'. 
> Or you could de-bounce the monitor test somehow.  Maybe configure it with 
> two values, a low-water mark and a high-water mark, and exit with different 
> exit codes. e.g. use exit code 1 when temperature is 75-78, exit code 2 
> with temperature over 78.  Then you could only send email on temperatures 
> in the 75-78 range, and page on temps over 78.
> 
> -David

Yeah, the second lesson is to never ever enable paging :)

I did run it for all of yesterday, just not through the surprise climate 
change that happens at night here apparently.

My first thought was the threshold, but the "last error" through me 
somewhat.

Well, I would have to agree with your idea there...  I've traced through 
all the code for that monitor and have not found any issues.  I've also 
not been able to replicate it today (which just pisses me off all the 
more).  But that was not thinking it was crossing the threshold.

I set the alert temperature to right around where it is now, and 
hopefully in a little while I will have replicated the problem...

<a little while passes>

Well, it seems to be doing the same thing, bouncing around.  So thank 
you for whacking me on the head with that.

I am thinking the solution for this is to set my alerts a bit higher, 
and sent the alert after a bit higher to account for it.  But not much 
higher since they should no longer be on the "verge" - even at night 
during climate control.

Hey, thanks for the help - I appreciate it.





_______________________________________________
mon mailing list
mon@linux.kernel.org
http://linux.kernel.org/mailman/listinfo/mon

Reply via email to