I have a great relationship with some SuperMicro engineers, if others can
provide part #'s and firmare/bios revs, I can bring this up with them.
________________________________________
From: owner-m...@openbsd.org
<owner-m...@openbsd.org> on behalf of li...@wrant.com <li...@wrant.com>
Sent:
Wednesday, October 21, 2015 8:50 PM
To: misc@openbsd.org
Subject: Re:
requesting help working around boot failures with supermicro atom board
Synopsis: if sensors show missing data then reset the BMC unit before
rebooting the system to prevent unable to boot long beep issue.

I found a
reliably reproducible workaround for this problem retaining
control continuity
without the need to trip the mains breaker.  This
entirely prevents the long
beep issue and allows the system to be used
in headless remote environments
without ensuring remote mains power
cycle capability and/or remote hands
intervention.

I have not had to disable the lm(4) sensor as advised
previously for
the workaround and reached the conclusion this problem is not
caused
by the driver itself in the first place, but by a buggy BMC firmware.
For this it is advisable to contact again the technical support at
Supermicro
and ask them for a reliable BMC firmware update which does
not manifest the
problem.

After running for a longer period (non specific or deterministic,
above
30min), the sensors start to display wrong (missing) values and can not
provide data points to the BMC firmware.  This is seen both in IPMI
direct and
networked access and in the web based management interface.
At this point, a
reboot would get the system unable to boot manifesting
the dreaded long beep.
Only a power cycle of mains (power supply
breaker or power distribution unit)
for a couple of seconds unblocks
the system and it is capable of successfully
booting up again.  This
however totally undermines the remote control
capabilities of the
system effectively turning it into a continuous source of
remote
management manual reboot requests via intervention events for mains
power cycle (stop and start).

The workaround for this is to reset the BMC
before attempting to reboot
the system, and it works over the network directly
over IPMI and also
via the web based BMC interface likewise.  This only
reboots the IPMI
controller (not the system) and its embedded firmware, then
after a
couple of minutes the sensors poll actual correct data and display it
properly.  At this point a system reboot issued succeeds as expected and
everything the system boots up and works properly, until some non
specific
longer time passes again (from 1h to days) and the BMC
controller gets stuck
again (with a certainty it gets stuck) for which
the indication is missing
sensors data and no reboot capability with
the long beep indication.

This is
NOT OS specific unless the driver polling the sensors causes
the sensors
sub-system in the embedded controller OS to crash, the only
factor affecting
it so far is found to be the time running the system
without mains power
cycle.  It is a flaw of the BMC firmware for which
the solution for sure is to
demand an updated firmware from Supermicro
without this fault.  It would help
if more people voice their concerns
over this so an updated BMC firmware is
issued from Supermicro technical
support and published on their web site.
Here is how it looks when the BMC is stuck:

$ ipmi-sensor
System Temp      |
no reading        | ns
CPU Temp         | no reading        | ns
CPU FAN
| no reading        | ns
SYS FAN          | no reading        | ns
CPU Vcore
| no reading        | ns
Vichcore         | no reading        | ns
+3.3VCC
| no reading        | ns
VDIMM            | no reading        | ns
+5 V
| no reading        | ns
+12 V            | no reading        | ns
+3.3VSB
| no reading        | ns
VBAT             | no reading        | ns
Chassis
Intru    | no reading        | ns
PS Status        | 0x00              | ok

$
ipmi-sensor-detail
System Temp      | na         |            | na    | na
| na        | na        | na        | na        | na
CPU Temp         | na
|            | na    | na        | na        | na        | na        | na
| na
CPU FAN          | na         |            | na    | na        | na
| na        | na        | na        | na
SYS FAN          | na         |
| na    | na        | na        | na        | na        | na        | na
CPU
Vcore        | na         |            | na    | na        | na        | na
| na        | na        | na
Vichcore         | na         |            | na
| na        | na        | na        | na        | na        | na
+3.3VCC
| na         |            | na    | na        | na        | na        | na
| na        | na
VDIMM            | na         |            | na    | na
| na        | na        | na        | na        | na
+5 V             | na
|            | na    | na        | na        | na        | na        | na
| na
+12 V            | na         |            | na    | na        | na
| na        | na        | na        | na
+3.3VSB          | na         |
| na    | na        | na        | na        | na        | na        | na
VBAT
| na         |            | na    | na        | na        | na        | na
| na        | na
Chassis Intru    | na         | discrete   | na    | na
| na        | na        | na        | na        | na
PS Status        | 0x0
| discrete   | 0x00ff| na        | na        | na        | na        | na
| na

Here is how it looks after BMC reset:

$ ipmi-reset
Sent cold reset
command to MC

~75 seconds later:

$ ipmi-sensor
System Temp      | 38 degrees
C      | ok
CPU Temp         | 38 degrees C      | ok
CPU FAN          | no
reading        | ns
SYS FAN          | no reading        | ns
CPU Vcore
| 1.10 Volts        | ok
Vichcore         | 1.04 Volts        | ok
+3.3VCC
| 3.31 Volts        | ok
VDIMM            | 1.53 Volts        | ok
+5 V
| 5.09 Volts        | ok
+12 V            | 12.03 Volts       | ok
+3.3VSB
| 3.28 Volts        | ok
VBAT             | 3.12 Volts        | ok
Chassis
Intru    | 0x00              | ok
PS Status        | 0x00              | ok

$
ipmi-sensor-detail
System Temp      | 38.000     | degrees C  | ok    | -9.000
| -7.000    | -5.000    | 75.000    | 77.000    | 79.000
CPU Temp         |
38.000     | degrees C  | ok    | -11.000   | -8.000    | -5.000    | 85.000
| 90.000    | 95.000
CPU FAN          | na         |            | na    | na
| na        | na        | na        | na        | na
SYS FAN          | na
|            | na    | na        | na        | na        | na        | na
| na
CPU Vcore        | 1.096      | Volts      | ok    | 0.640     | 0.664
| 0.688     | 1.344     | 1.408     | 1.472
Vichcore         | 1.040      |
Volts      | ok    | 0.808     | 0.824     | 0.840     | 1.160     | 1.176
| 1.192
+3.3VCC          | 3.312      | Volts      | ok    | 2.816     | 2.880
| 2.944     | 3.584     | 3.648     | 3.712
VDIMM            | 1.528      |
Volts      | ok    | 1.312     | 1.328     | 1.344     | 1.648     | 1.664
| 1.680
+5 V             | 5.088      | Volts      | ok    | 4.096     | 4.320
| 4.576     | 5.344     | 5.600     | 5.632
+12 V            | 12.031     |
Volts      | ok    | 10.706    | 10.600    | 10.494    | 13.091    | 13.197
| 13.303
+3.3VSB          | 3.280      | Volts      | ok    | 2.816     |
2.880     | 2.944     | 3.584     | 3.648     | 3.712
VBAT             | 3.120
| Volts      | ok    | 2.560     | 2.624     | 2.688     | 3.328     | 3.392
| 3.456
Chassis Intru    | 0x0        | discrete   | 0x0000| na        | na
| na        | na        | na        | na
PS Status        | 0x0        |
discrete   | 0x00ff| na        | na        | na        | na        | na
| na

The main board with this specific workaround applicable is:
MBD-X7SPA-HF-D525-O

The main board was bought in May 2011 brand new in
original packing
from official retailer carrying Supermicro products and uses
memory
modules from the qualified vendor list.
http://www.supermicro.com/products/motherboard/ATOM/ICH9/X7SPA-HF-D525.cfm
The BMC and BIOS firmwares are the latest available from the Supermicro
web
site:

Firmware Revision: 03.16
Firmware Build Time: 2014-06-30

Supermicro
X7SPA/X7SPE/X7SPT Series BIOS Date:07/19/13 BIOS Rev:1.2b

Hopefully this
helps in further diagnostics and in the meantime as a
workaround to allow
people with boards having the same problem to
operate them remotely until a
BMC firmware is available fixing the
issue.

Regards,
Anton

Reply via email to