Hey there all,

We’ve got several SuperMicro servers at work of varying vintages.  This is a 
fairly large vendor that does a LOT of oem work, and they’ve been sort of slow 
to adopt the concept of things like SNMP for monitoring (several of my older 
boards don’t support SNMP at all).  

In trying to get monitoring of the hardware going via ipmi, rather than 
reinventing the wheel and parsing ipmitool output I stumbled across this tool: 
https://github.com/thomas-krenn/check_ipmi_sensor_v3 -- which uses freeipmi 
instead of ipmitool, which has caused me to be more curious about the 
differences between the tools.  

Here’s the issues I’m seeing, mainly around psu’s and physical security:

in a system that has two PSU slots, slots are listed as ‘OK' (versus 'Presence 
detected’) if a system does not have a PSU installed in that slot.  (To be 
clear, the BMC also seems to see this as a non-issue, it doesn’t return 
critical in the WebUI, doesn’t turn on the red light on the chassis), so it 
*might* be the work of the plugin parsing this data to do that work, but I 
would like to assume I put PSU’s everywhere I need them, and treat ‘OK’ as an 
error.  Since freeipmi gives me config files where I can set this, it just 
seems to be a matter of figuring out how to do so, right?

Output with PSU2 removed:
1947 | PS1 Status | Power Supply | Nominal | N/A | N/A | N/A | N/A | N/A | N/A 
| N/A | N/A | 'Presence detected'
2483 | PS2 Status | Power Supply | Nominal | N/A | N/A | N/A | N/A | N/A | N/A 
| N/A | N/A | 'OK'

If a power supply is faulted in some other way (no AC power for example), 
ipmi-sensors lists the status as “nominal”.  

ID   | Name       | Type         | State    | Reading    | Units | Lower NR   | 
Lower C    | Lower NC   | Upper NC   | Upper C    | Upper NR   | Event
1947 | PS1 Status | Power Supply | Nominal  | N/A        | N/A   | N/A        | 
N/A        | N/A        | N/A        | N/A        | N/A        | 'Presence 
detected'
2483 | PS2 Status | Power Supply | Nominal  | N/A        | N/A   | N/A        | 
N/A        | N/A        | N/A        | N/A        | N/A        | 'Presence 
detected' 'Power Supply Failure detected' 'Power Supply input lost (AC/DC)’

Given that freeipmi_interpret_sensor.conf *has* entries for this, I lost an 
hour or two trying to make them say something other than nominal, with no luck:

## IPMI_Power_Supply
#
# IPMI_Power_Supply_No_Event                                    Nominal
# IPMI_Power_Supply_Presence_Detected                           Nominal
# IPMI_Power_Supply_Power_Supply_Failure_Detected               Critical
# IPMI_Power_Supply_Predictive_Failure                          Critical
# IPMI_Power_Supply_Power_Supply_Input_Lost_AC_DC               Critical
# IPMI_Power_Supply_Power_Supply_Input_Lost_Or_Out_Of_Range     Critical
# IPMI_Power_Supply_Power_Supply_Input_Out_Of_Range_But_Present Critical
# IPMI_Power_Supply_Configuration_Error                         Critical
# IPMI_Power_Supply_Power_Supply_Inactive                       Warning

Just as a sanity check, I gathered ipmitool data as well:

ipmitool output from A healthy system:
Chassis Intru    | 0x0        | discrete   | 0x0000| na        | na        | na 
       | na        | na        | na
PS1 Status       | 0x1        | discrete   | 0x0100| na        | na        | na 
       | na        | na        | na
PS2 Status       | 0x1        | discrete   | 0x0100| na        | na        | na 
       | na        | na        | na

A system that has had its cover listed and ps2 is unplugged:
Chassis Intru    | 0x1        | discrete   | 0x0100| na        | na        | na 
       | na        | na        | na
PS1 Status       | 0x1        | discrete   | 0x0100| na        | na        | na 
       | na        | na        | na
PS2 Status       | 0xb        | discrete   | 0x0b00| na        | na        | na 
       | na        | na        | na

(I want to say that “ok” was 0x0000 for an “OK” power supply but I don’t see it 
in my scrollback).

1) What I think is going on here is that the power supply *has* a reading, but 
for some reason, ipmi-sensors isn’t reading it (at least not according to -vv), 
and thus cannot intpret a result code that yields one of the above.  An option 
(like -f or —force) to force it to gather a reading would be useful here, 
perhaps?  I thought that’s what '-W discretereading' was, but it doesn’t seem 
to cause a change.

Better question: how can “state” be nominal if there’s no reading? :)  Should I 
trust *any* sensor which reads nominal?

2) In working with the above, I’ve tried setting —intepret-oem-data, but I 
cannot tell if that means “parse all the weird oem data you possibly know 
about, (akin to telling snmpwalk to load ALL the mibs) or if it is more a case 
of “poll the board and then load data for that specific vendor”.  The -vv 
output doesn’t tell you which data it’s parsing/using, if any at all.  (Perhaps 
how this is used could be clarified?).

3) I’ve tried to construct IPMI_OEM_Bitmask and IPMI_OEM_Values for the above 
items, but I’m not seeing the vendor or manufacturer id in the output of -vv, 
and it’s not clear to me if the “Events” above are text strings read from the 
BMC, or if they’re interpreted, and there’s a raw value I could choose to 
parse.  

I’m loathe to start throwing ipmi-raw commands at my devices, especially 
considering they’ve proven a little bit weird about following the spec, and 
I’ve stopped short of following a tcpdump of the conversation.  Knowing what 
the best way forward to construct these lines would be helpful.  Documenting 
how to *get* the vendor string would help?  (For example, for *this* specific 
vendor, I would like Power Supply “ok” to be a critical error, because that’s 
what the BMC returns when it’s missing, whereas on a FooCorp machine, OK may be 
perfectly valid).

Additionally, being able to tell the tools the “-n” equivalent of netstat 
(don’t interpret, only show raw values) would be helpful — how we get from some 
hex code (that I can’t get the tools to show), to 
IPMI_Power_Supply_Presence_Detected, to ‘Presence Detected'.  (I also thought 
*this* was -W discretereading would do, but this is not the case either).

-Dan

(Note: freeipmi_interpret_sel.conf does yield critical on some failures, as the 
psu’s and chassis intrustion read as critical, but I still can’t figure out how 
to make “ok” be a critical state — I think this is “Sensors Issue" 13 in 
https://github.com/elitak/freeipmi/blob/master/doc/freeipmi-bugs-issues-and-workarounds.txt.
  Weirdly, nothing shows up in the SEL when a PSU is removed or inserted!  So I 
don’t think there’s an event we can look for here, but I’d be happy to be 
proven wrong.  [This is the log from the system where I had the datacenter 
staff pull the PSU, reinsert it, and then disconnect the cord], so I think this 
treat-ok-as-critical reading *has* to come from the interpretation of 
ipmi-sensors, rather than ipmi-sel.)

/usr/local/sbin/ipmi-sel --output-event-state
ID   | Date        | Time     | Name             | Type              | State    
| Event
1    | Nov-25-2024 | 21:47:47 | Chassis Intru    | Physical Security | Critical 
| General Chassis Intrusion
2    | Dec-09-2024 | 20:45:38 | Chassis Intru    | Physical Security | Critical 
| General Chassis Intrusion
3    | Dec-09-2024 | 21:12:29 | PS2 Status       | Power Supply      | Critical 
| Power Supply Failure detected


_______________________________________________
Freeipmi-users mailing list
Freeipmi-users@gnu.org
https://lists.gnu.org/mailman/listinfo/freeipmi-users

Reply via email to