Here's how the IPMI watchdog should work: 1) configure it for action, timeout, etc. 2) start the watchdog timer 3) some application (ipmitool raw, ipmiutil wdt, etc.) or driver (openipmi) restarts the watchdog timer before the timeout, iteratively.
The most common problem case is that 1 & 2 are done, but 3 is not, so the system resets every time. It could also be a SuperMicro firmware bug, but it would require more detail about the test case to file a firmware bug report. How are you performing these steps? Andy -----Original Message----- From: Kelsey Cummings [mailto:k...@corp.sonic.net] Sent: Wednesday, July 20, 2011 5:31 PM To: ipmitool-devel@lists.sourceforge.net Subject: [Ipmitool-devel] supermicro ipmi watchdog issues We have a series of Supermicro systems w/IPMI running RHEL 5.5. We're using IPMI primarily to monitor psu status and for the hardware watchdog support teamed with the watchdog service. 3 or 4 out of 8 identical systems have exhibited hardware watchdog triggered resets for no apparent reason. Best we can tell, despite the OS and hardware being perfectly healthy (no other errors, and the systems work fine after the watchdog is disabled,) the hardware watchdog is triggering a reset on its own, and worse, the boxes do not appear to come back from it. Anyone else seen similar issue or have any input? So far Supermicro has suggested we disable the hardware watchdog... /etc/watchdog.conf contains the minimal config: interval = 10 realtime = yes priority = 1 watchdog-device = /dev/watchdog The SEL, the Power Supply events are due to remotely resetting the power, and, of course, all of these systems are at remote pops. 3c | 07/02/2011 | 19:58:40 | Watchdog 2 #0xfe | Hard reset | Asserted 3d | 07/02/2011 | 21:05:27 | Power Supply #0x14 | State Asserted 3e | Pre-Init Time-stamp | Physical Security #0x44 | General Chassis intrusion | Asserted 3f | Pre-Init Time-stamp | Fan #0x0f | Lower Non-critical going low 40 | Pre-Init Time-stamp | Fan #0x0f | Lower Critical going low 41 | Pre-Init Time-stamp | Fan #0x0f | Lower Non-recoverable going low 42 | Pre-Init Time-stamp | Fan #0x10 | Lower Non-critical going low 43 | Pre-Init Time-stamp | Fan #0x10 | Lower Critical going low 44 | Pre-Init Time-stamp | Fan #0x10 | Lower Non-recoverable going low 45 | Pre-Init Time-stamp | Fan #0x11 | Lower Non-critical going low 46 | Pre-Init Time-stamp | Fan #0x11 | Lower Critical going low 47 | Pre-Init Time-stamp | Fan #0x11 | Lower Non-recoverable going low 48 | Pre-Init Time-stamp | Fan #0x12 | Lower Non-critical going low 49 | Pre-Init Time-stamp | Fan #0x12 | Lower Critical going low 4a | Pre-Init Time-stamp | Fan #0x12 | Lower Non-recoverable going low 4b | 07/02/2011 | 21:11:30 | Watchdog 2 #0xfe | Hard reset | Asserted 4c | 07/02/2011 | 21:13:39 | Power Supply #0x14 | State Asserted 4d | Pre-Init Time-stamp | Physical Security #0x44 | General Chassis intrusion | Asserted 4e | Pre-Init Time-stamp | Fan #0x0f | Lower Non-critical going low 4f | Pre-Init Time-stamp | Fan #0x0f | Lower Critical going low 50 | Pre-Init Time-stamp | Fan #0x0f | Lower Non-recoverable going low 51 | Pre-Init Time-stamp | Fan #0x10 | Lower Non-critical going low 52 | Pre-Init Time-stamp | Fan #0x10 | Lower Critical going low 53 | Pre-Init Time-stamp | Fan #0x10 | Lower Non-recoverable going low 54 | Pre-Init Time-stamp | Fan #0x11 | Lower Non-critical going low 55 | Pre-Init Time-stamp | Fan #0x11 | Lower Critical going low 56 | Pre-Init Time-stamp | Fan #0x11 | Lower Non-recoverable going low 57 | Pre-Init Time-stamp | Fan #0x12 | Lower Non-critical going low 58 | Pre-Init Time-stamp | Fan #0x12 | Lower Critical going low 59 | Pre-Init Time-stamp | Fan #0x12 | Lower Non-recoverable going low 5a | 07/02/2011 | 21:18:53 | Watchdog 2 #0xfe | Hard reset | Asserted 5b | 07/02/2011 | 21:26:55 | Power Supply #0x14 | State Asserted 5c | Pre-Init Time-stamp | Physical Security #0x44 | General Chassis intrusion | Asserted 5d | Pre-Init Time-stamp | Fan #0x0f | Lower Non-critical going low 5e | Pre-Init Time-stamp | Fan #0x0f | Lower Critical going low 5f | Pre-Init Time-stamp | Fan #0x0f | Lower Non-recoverable going low 60 | Pre-Init Time-stamp | Fan #0x10 | Lower Non-critical going low 61 | Pre-Init Time-stamp | Fan #0x10 | Lower Critical going low 62 | Pre-Init Time-stamp | Fan #0x10 | Lower Non-recoverable going low 63 | Pre-Init Time-stamp | Fan #0x11 | Lower Non-critical going low 64 | Pre-Init Time-stamp | Fan #0x11 | Lower Critical going low 65 | Pre-Init Time-stamp | Fan #0x11 | Lower Non-recoverable going low 66 | Pre-Init Time-stamp | Fan #0x12 | Lower Non-critical going low 67 | Pre-Init Time-stamp | Fan #0x12 | Lower Critical going low 68 | Pre-Init Time-stamp | Fan #0x12 | Lower Non-recoverable going low 69 | 07/02/2011 | 21:35:02 | Watchdog 2 #0xfe | Timer expired | Asserted -- Kelsey Cummings - k...@corp.sonic.net sonic.net, inc. System Architect 2260 Apollo Way 707.522.1000 Santa Rosa, CA 95407 ------------------------------------------------------------------------ ------ 10 Tips for Better Web Security Learn 10 ways to better secure your business today. Topics covered include: Web security, SSL, hacker attacks & Denial of Service (DoS), private keys, security Microsoft Exchange, secure Instant Messaging, and much more. http://www.accelacomm.com/jaw/sfnl/114/51426210/ _______________________________________________ Ipmitool-devel mailing list Ipmitool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ipmitool-devel ------------------------------------------------------------------------------ 5 Ways to Improve & Secure Unified Communications Unified Communications promises greater efficiencies for business. UC can improve internal communications as well as offer faster, more efficient ways to interact with customers and streamline customer service. Learn more! http://www.accelacomm.com/jaw/sfnl/114/51426253/ _______________________________________________ Ipmitool-devel mailing list Ipmitool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ipmitool-devel