Hey there all, We’ve got several SuperMicro servers at work of varying vintages. This is a fairly large vendor that does a LOT of oem work, and they’ve been sort of slow to adopt the concept of things like SNMP for monitoring (several of my older boards don’t support SNMP at all).
In trying to get monitoring of the hardware going via ipmi, rather than reinventing the wheel and parsing ipmitool output I stumbled across this tool: https://github.com/thomas-krenn/check_ipmi_sensor_v3 -- which uses freeipmi instead of ipmitool, which has caused me to be more curious about the differences between the tools. Here’s the issues I’m seeing, mainly around psu’s and physical security: in a system that has two PSU slots, slots are listed as ‘OK' (versus 'Presence detected’) if a system does not have a PSU installed in that slot. (To be clear, the BMC also seems to see this as a non-issue, it doesn’t return critical in the WebUI, doesn’t turn on the red light on the chassis), so it *might* be the work of the plugin parsing this data to do that work, but I would like to assume I put PSU’s everywhere I need them, and treat ‘OK’ as an error. Since freeipmi gives me config files where I can set this, it just seems to be a matter of figuring out how to do so, right? Output with PSU2 removed: 1947 | PS1 Status | Power Supply | Nominal | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | 'Presence detected' 2483 | PS2 Status | Power Supply | Nominal | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | 'OK' If a power supply is faulted in some other way (no AC power for example), ipmi-sensors lists the status as “nominal”. ID | Name | Type | State | Reading | Units | Lower NR | Lower C | Lower NC | Upper NC | Upper C | Upper NR | Event 1947 | PS1 Status | Power Supply | Nominal | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | 'Presence detected' 2483 | PS2 Status | Power Supply | Nominal | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | 'Presence detected' 'Power Supply Failure detected' 'Power Supply input lost (AC/DC)’ Given that freeipmi_interpret_sensor.conf *has* entries for this, I lost an hour or two trying to make them say something other than nominal, with no luck: ## IPMI_Power_Supply # # IPMI_Power_Supply_No_Event Nominal # IPMI_Power_Supply_Presence_Detected Nominal # IPMI_Power_Supply_Power_Supply_Failure_Detected Critical # IPMI_Power_Supply_Predictive_Failure Critical # IPMI_Power_Supply_Power_Supply_Input_Lost_AC_DC Critical # IPMI_Power_Supply_Power_Supply_Input_Lost_Or_Out_Of_Range Critical # IPMI_Power_Supply_Power_Supply_Input_Out_Of_Range_But_Present Critical # IPMI_Power_Supply_Configuration_Error Critical # IPMI_Power_Supply_Power_Supply_Inactive Warning Just as a sanity check, I gathered ipmitool data as well: ipmitool output from A healthy system: Chassis Intru | 0x0 | discrete | 0x0000| na | na | na | na | na | na PS1 Status | 0x1 | discrete | 0x0100| na | na | na | na | na | na PS2 Status | 0x1 | discrete | 0x0100| na | na | na | na | na | na A system that has had its cover listed and ps2 is unplugged: Chassis Intru | 0x1 | discrete | 0x0100| na | na | na | na | na | na PS1 Status | 0x1 | discrete | 0x0100| na | na | na | na | na | na PS2 Status | 0xb | discrete | 0x0b00| na | na | na | na | na | na (I want to say that “ok” was 0x0000 for an “OK” power supply but I don’t see it in my scrollback). 1) What I think is going on here is that the power supply *has* a reading, but for some reason, ipmi-sensors isn’t reading it (at least not according to -vv), and thus cannot intpret a result code that yields one of the above. An option (like -f or —force) to force it to gather a reading would be useful here, perhaps? I thought that’s what '-W discretereading' was, but it doesn’t seem to cause a change. Better question: how can “state” be nominal if there’s no reading? :) Should I trust *any* sensor which reads nominal? 2) In working with the above, I’ve tried setting —intepret-oem-data, but I cannot tell if that means “parse all the weird oem data you possibly know about, (akin to telling snmpwalk to load ALL the mibs) or if it is more a case of “poll the board and then load data for that specific vendor”. The -vv output doesn’t tell you which data it’s parsing/using, if any at all. (Perhaps how this is used could be clarified?). 3) I’ve tried to construct IPMI_OEM_Bitmask and IPMI_OEM_Values for the above items, but I’m not seeing the vendor or manufacturer id in the output of -vv, and it’s not clear to me if the “Events” above are text strings read from the BMC, or if they’re interpreted, and there’s a raw value I could choose to parse. I’m loathe to start throwing ipmi-raw commands at my devices, especially considering they’ve proven a little bit weird about following the spec, and I’ve stopped short of following a tcpdump of the conversation. Knowing what the best way forward to construct these lines would be helpful. Documenting how to *get* the vendor string would help? (For example, for *this* specific vendor, I would like Power Supply “ok” to be a critical error, because that’s what the BMC returns when it’s missing, whereas on a FooCorp machine, OK may be perfectly valid). Additionally, being able to tell the tools the “-n” equivalent of netstat (don’t interpret, only show raw values) would be helpful — how we get from some hex code (that I can’t get the tools to show), to IPMI_Power_Supply_Presence_Detected, to ‘Presence Detected'. (I also thought *this* was -W discretereading would do, but this is not the case either). -Dan (Note: freeipmi_interpret_sel.conf does yield critical on some failures, as the psu’s and chassis intrustion read as critical, but I still can’t figure out how to make “ok” be a critical state — I think this is “Sensors Issue" 13 in https://github.com/elitak/freeipmi/blob/master/doc/freeipmi-bugs-issues-and-workarounds.txt. Weirdly, nothing shows up in the SEL when a PSU is removed or inserted! So I don’t think there’s an event we can look for here, but I’d be happy to be proven wrong. [This is the log from the system where I had the datacenter staff pull the PSU, reinsert it, and then disconnect the cord], so I think this treat-ok-as-critical reading *has* to come from the interpretation of ipmi-sensors, rather than ipmi-sel.) /usr/local/sbin/ipmi-sel --output-event-state ID | Date | Time | Name | Type | State | Event 1 | Nov-25-2024 | 21:47:47 | Chassis Intru | Physical Security | Critical | General Chassis Intrusion 2 | Dec-09-2024 | 20:45:38 | Chassis Intru | Physical Security | Critical | General Chassis Intrusion 3 | Dec-09-2024 | 21:12:29 | PS2 Status | Power Supply | Critical | Power Supply Failure detected _______________________________________________ Freeipmi-users mailing list Freeipmi-users@gnu.org https://lists.gnu.org/mailman/listinfo/freeipmi-users