--On October 13, 2006 9:41:52 AM -0400 Bill Chmura <[EMAIL PROTECTED]> wrote:
> > Hello, > > Yesterday I installed two temperature sensors in my server room. I set > them both for 10 degrees higher than the current. Well, the building > people raise the temperature up at night to save on energy. > > I do have my own cooling system in there, but it did not compensate for > the building raising and set off the alarms. > > My threshold was for 75 degrees and the peek it went up to was 76. > Unfortunately it paged me around 75 times last night. > Ah, I believe you've just learned the first lesson of monitoring... Never enable paging on a new test/service until you've run the monitoring test for a while first. > So it basically went like this: > > ALERT (temp 75.7) > UPALERT (temp 75.3) > ALERT (temp 75.4) > UPALERT (temp 75.6) > etc, etc... > > All of these are above the stated MAX limit of 75. For some reason, > ever other one is coming as good news - even though the temperature > could have gone up. > > > I am going to spend part of today insuring I can sleep tonight (first by > raising the MAX temp) by solving this - but if anyone has any thoughts > on this - i would love to hear them. I have a suspicion of whats going on here. I believe the current mon version has a feature (or bug, depending on your point of view) where the UPALERT summary & detail messages are actualy from the last failure, not from the OK test. I suspect the temperature was actually crossing the threshold repeatedly, something like this: test 1: 75.7 -> ALERT 75.7 test 2: 75.3 (no alert) test 3: <75 -> UPALERT 75.3 etc... There has been debate in the past about whether providing the 'last failure' content is useful for indicating what failure ended, or is confusing because it looks like its saying that state is OK. I feel its confusing, and at CMU we're running with a patched mon that provides the success output during an upalert. I can't remember right now whether a decision was made about changing this behavior. If we decided to change it, the change must have gotten missed during one of the big merges between Jim's alert structure rewrites and my behavior changes. So, the messages you got were confusing, but the temperature was probably crossing your threshold repeatedly. You might want to experiment with putting a longer threshold in place before you alert, i.e. 'alertafter 3'. Or you could de-bounce the monitor test somehow. Maybe configure it with two values, a low-water mark and a high-water mark, and exit with different exit codes. e.g. use exit code 1 when temperature is 75-78, exit code 2 with temperature over 78. Then you could only send email on temperatures in the 75-78 range, and page on temps over 78. -David _______________________________________________ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon