This isn't exactly an scientific linux issue but I hope that folks here may be more likely to be using IPMI then some of the other lists.

We have a series of Supermicro systems w/IPMI running RHEL 5.5. We're using IPMI primarily to monitor psu status and for the hardware watchdog support teamed with the watchdog service. 3 or 4 out of 8 identical systems have exhibited hardware watchdog triggered resets for no apparent reason.

Best we can tell, despite the OS and hardware being perfectly healthy (no other errors, and the systems work fine after the watchdog is disabled,) the hardware watchdog is triggering a reset on its own, and worse, the boxes do not appear to come back from it.

Anyone else seen similar issue or have any input?

So far Supermicro has suggested we disable the hardware watchdog...

/etc/watchdog.conf contains the minimal config:
interval = 10
realtime = yes
priority = 1
watchdog-device = /dev/watchdog

The SEL, the Power Supply events are due to remotely resetting the power, and, of course, all of these systems are at remote pops.

  3c | 07/02/2011 | 19:58:40 | Watchdog 2 #0xfe | Hard reset | Asserted
  3d | 07/02/2011 | 21:05:27 | Power Supply #0x14 | State Asserted
3e | Pre-Init Time-stamp | Physical Security #0x44 | General Chassis intrusion | Asserted
  3f | Pre-Init Time-stamp   | Fan #0x0f | Lower Non-critical going low
  40 | Pre-Init Time-stamp   | Fan #0x0f | Lower Critical going low
  41 | Pre-Init Time-stamp   | Fan #0x0f | Lower Non-recoverable going low
  42 | Pre-Init Time-stamp   | Fan #0x10 | Lower Non-critical going low
  43 | Pre-Init Time-stamp   | Fan #0x10 | Lower Critical going low
  44 | Pre-Init Time-stamp   | Fan #0x10 | Lower Non-recoverable going low
  45 | Pre-Init Time-stamp   | Fan #0x11 | Lower Non-critical going low
  46 | Pre-Init Time-stamp   | Fan #0x11 | Lower Critical going low
  47 | Pre-Init Time-stamp   | Fan #0x11 | Lower Non-recoverable going low
  48 | Pre-Init Time-stamp   | Fan #0x12 | Lower Non-critical going low
  49 | Pre-Init Time-stamp   | Fan #0x12 | Lower Critical going low
  4a | Pre-Init Time-stamp   | Fan #0x12 | Lower Non-recoverable going low
  4b | 07/02/2011 | 21:11:30 | Watchdog 2 #0xfe | Hard reset | Asserted
  4c | 07/02/2011 | 21:13:39 | Power Supply #0x14 | State Asserted
4d | Pre-Init Time-stamp | Physical Security #0x44 | General Chassis intrusion | Asserted
  4e | Pre-Init Time-stamp   | Fan #0x0f | Lower Non-critical going low
  4f | Pre-Init Time-stamp   | Fan #0x0f | Lower Critical going low
  50 | Pre-Init Time-stamp   | Fan #0x0f | Lower Non-recoverable going low
  51 | Pre-Init Time-stamp   | Fan #0x10 | Lower Non-critical going low
  52 | Pre-Init Time-stamp   | Fan #0x10 | Lower Critical going low
  53 | Pre-Init Time-stamp   | Fan #0x10 | Lower Non-recoverable going low
  54 | Pre-Init Time-stamp   | Fan #0x11 | Lower Non-critical going low
  55 | Pre-Init Time-stamp   | Fan #0x11 | Lower Critical going low
  56 | Pre-Init Time-stamp   | Fan #0x11 | Lower Non-recoverable going low
  57 | Pre-Init Time-stamp   | Fan #0x12 | Lower Non-critical going low
  58 | Pre-Init Time-stamp   | Fan #0x12 | Lower Critical going low
  59 | Pre-Init Time-stamp   | Fan #0x12 | Lower Non-recoverable going low
  5a | 07/02/2011 | 21:18:53 | Watchdog 2 #0xfe | Hard reset | Asserted
  5b | 07/02/2011 | 21:26:55 | Power Supply #0x14 | State Asserted
5c | Pre-Init Time-stamp | Physical Security #0x44 | General Chassis intrusion | Asserted
  5d | Pre-Init Time-stamp   | Fan #0x0f | Lower Non-critical going low
  5e | Pre-Init Time-stamp   | Fan #0x0f | Lower Critical going low
  5f | Pre-Init Time-stamp   | Fan #0x0f | Lower Non-recoverable going low
  60 | Pre-Init Time-stamp   | Fan #0x10 | Lower Non-critical going low
  61 | Pre-Init Time-stamp   | Fan #0x10 | Lower Critical going low
  62 | Pre-Init Time-stamp   | Fan #0x10 | Lower Non-recoverable going low
  63 | Pre-Init Time-stamp   | Fan #0x11 | Lower Non-critical going low
  64 | Pre-Init Time-stamp   | Fan #0x11 | Lower Critical going low
  65 | Pre-Init Time-stamp   | Fan #0x11 | Lower Non-recoverable going low
  66 | Pre-Init Time-stamp   | Fan #0x12 | Lower Non-critical going low
  67 | Pre-Init Time-stamp   | Fan #0x12 | Lower Critical going low
  68 | Pre-Init Time-stamp   | Fan #0x12 | Lower Non-recoverable going low
  69 | 07/02/2011 | 21:35:02 | Watchdog 2 #0xfe | Timer expired | Asserted


--
Kelsey Cummings - [email protected]      sonic.net, inc.
System Architect                          2260 Apollo Way
707.522.1000                              Santa Rosa, CA 95407

Reply via email to