Here's how the IPMI watchdog should work:
1) configure it for action, timeout, etc.
2) start the watchdog timer
3) some application (ipmitool raw, ipmiutil wdt, etc.) or driver
(openipmi) restarts the watchdog timer before the timeout, iteratively.


The most common problem case is that 1 & 2 are done, but 3 is not, so
the system resets every time.
It could also be a SuperMicro firmware bug, but it would require more
detail about the test case to file a firmware bug report.  
How are you performing these steps?

Andy

-----Original Message-----
From: Kelsey Cummings [mailto:k...@corp.sonic.net] 
Sent: Wednesday, July 20, 2011 5:31 PM
To: ipmitool-devel@lists.sourceforge.net
Subject: [Ipmitool-devel] supermicro ipmi watchdog issues

We have a series of Supermicro systems w/IPMI running RHEL 5.5.  We're 
using IPMI primarily to monitor psu status and for the hardware watchdog

support teamed with the watchdog service.  3 or 4 out of 8 identical 
systems have exhibited hardware watchdog triggered resets for no 
apparent reason.

Best we can tell, despite the OS and hardware being perfectly healthy 
(no other errors, and the systems work fine after the watchdog is 
disabled,) the hardware watchdog is triggering a reset on its own, and 
worse, the boxes do not appear to come back from it.

Anyone else seen similar issue or have any input?

So far Supermicro has suggested we disable the hardware watchdog...

/etc/watchdog.conf contains the minimal config:
interval = 10
realtime = yes
priority = 1
watchdog-device = /dev/watchdog

The SEL, the Power Supply events are due to remotely resetting the 
power, and, of course, all of these systems are at remote pops.

   3c | 07/02/2011 | 19:58:40 | Watchdog 2 #0xfe | Hard reset | Asserted
   3d | 07/02/2011 | 21:05:27 | Power Supply #0x14 | State Asserted
   3e | Pre-Init Time-stamp   | Physical Security #0x44 | General 
Chassis intrusion | Asserted
   3f | Pre-Init Time-stamp   | Fan #0x0f | Lower Non-critical going low
   40 | Pre-Init Time-stamp   | Fan #0x0f | Lower Critical going low
   41 | Pre-Init Time-stamp   | Fan #0x0f | Lower Non-recoverable going
low
   42 | Pre-Init Time-stamp   | Fan #0x10 | Lower Non-critical going low
   43 | Pre-Init Time-stamp   | Fan #0x10 | Lower Critical going low
   44 | Pre-Init Time-stamp   | Fan #0x10 | Lower Non-recoverable going
low
   45 | Pre-Init Time-stamp   | Fan #0x11 | Lower Non-critical going low
   46 | Pre-Init Time-stamp   | Fan #0x11 | Lower Critical going low
   47 | Pre-Init Time-stamp   | Fan #0x11 | Lower Non-recoverable going
low
   48 | Pre-Init Time-stamp   | Fan #0x12 | Lower Non-critical going low
   49 | Pre-Init Time-stamp   | Fan #0x12 | Lower Critical going low
   4a | Pre-Init Time-stamp   | Fan #0x12 | Lower Non-recoverable going
low
   4b | 07/02/2011 | 21:11:30 | Watchdog 2 #0xfe | Hard reset | Asserted
   4c | 07/02/2011 | 21:13:39 | Power Supply #0x14 | State Asserted
   4d | Pre-Init Time-stamp   | Physical Security #0x44 | General 
Chassis intrusion | Asserted
   4e | Pre-Init Time-stamp   | Fan #0x0f | Lower Non-critical going low
   4f | Pre-Init Time-stamp   | Fan #0x0f | Lower Critical going low
   50 | Pre-Init Time-stamp   | Fan #0x0f | Lower Non-recoverable going
low
   51 | Pre-Init Time-stamp   | Fan #0x10 | Lower Non-critical going low
   52 | Pre-Init Time-stamp   | Fan #0x10 | Lower Critical going low
   53 | Pre-Init Time-stamp   | Fan #0x10 | Lower Non-recoverable going
low
   54 | Pre-Init Time-stamp   | Fan #0x11 | Lower Non-critical going low
   55 | Pre-Init Time-stamp   | Fan #0x11 | Lower Critical going low
   56 | Pre-Init Time-stamp   | Fan #0x11 | Lower Non-recoverable going
low
   57 | Pre-Init Time-stamp   | Fan #0x12 | Lower Non-critical going low
   58 | Pre-Init Time-stamp   | Fan #0x12 | Lower Critical going low
   59 | Pre-Init Time-stamp   | Fan #0x12 | Lower Non-recoverable going
low
   5a | 07/02/2011 | 21:18:53 | Watchdog 2 #0xfe | Hard reset | Asserted
   5b | 07/02/2011 | 21:26:55 | Power Supply #0x14 | State Asserted
   5c | Pre-Init Time-stamp   | Physical Security #0x44 | General 
Chassis intrusion | Asserted
   5d | Pre-Init Time-stamp   | Fan #0x0f | Lower Non-critical going low
   5e | Pre-Init Time-stamp   | Fan #0x0f | Lower Critical going low
   5f | Pre-Init Time-stamp   | Fan #0x0f | Lower Non-recoverable going
low
   60 | Pre-Init Time-stamp   | Fan #0x10 | Lower Non-critical going low
   61 | Pre-Init Time-stamp   | Fan #0x10 | Lower Critical going low
   62 | Pre-Init Time-stamp   | Fan #0x10 | Lower Non-recoverable going
low
   63 | Pre-Init Time-stamp   | Fan #0x11 | Lower Non-critical going low
   64 | Pre-Init Time-stamp   | Fan #0x11 | Lower Critical going low
   65 | Pre-Init Time-stamp   | Fan #0x11 | Lower Non-recoverable going
low
   66 | Pre-Init Time-stamp   | Fan #0x12 | Lower Non-critical going low
   67 | Pre-Init Time-stamp   | Fan #0x12 | Lower Critical going low
   68 | Pre-Init Time-stamp   | Fan #0x12 | Lower Non-recoverable going
low
   69 | 07/02/2011 | 21:35:02 | Watchdog 2 #0xfe | Timer expired |
Asserted


-- 
Kelsey Cummings - k...@corp.sonic.net      sonic.net, inc.
System Architect                          2260 Apollo Way
707.522.1000                              Santa Rosa, CA 95407

------------------------------------------------------------------------
------
10 Tips for Better Web Security
Learn 10 ways to better secure your business today. Topics covered
include:
Web security, SSL, hacker attacks & Denial of Service (DoS), private
keys,
security Microsoft Exchange, secure Instant Messaging, and much more.
http://www.accelacomm.com/jaw/sfnl/114/51426210/
_______________________________________________
Ipmitool-devel mailing list
Ipmitool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ipmitool-devel

------------------------------------------------------------------------------
5 Ways to Improve & Secure Unified Communications
Unified Communications promises greater efficiencies for business. UC can 
improve internal communications as well as offer faster, more efficient ways
to interact with customers and streamline customer service. Learn more!
http://www.accelacomm.com/jaw/sfnl/114/51426253/
_______________________________________________
Ipmitool-devel mailing list
Ipmitool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ipmitool-devel

Reply via email to