The problem is solved, but first things first: On 1/11/12 6:43 AM, Arnaud Quette wrote: > 2012/1/9 William Seligman <selig...@nevis.columbia.edu> > >> On 1/9/12 9:53 AM, Arnaud Quette wrote: >> >>> 2012/1/6 William Seligman <selig...@nevis.columbia.edu> >>> >>>> I've googled and RTFM'ed, but still can't solve this one. I hope you >>>> folks can. >>>> >>>> This affects my entire computer cluster, but let's start simple: I've >>>> got a computer running NUT; OS is Scientific Linux 5.5; kernel >>>> 2.6.18-274.12.1.el5xen. It connects to an APC SMART-UPS via an APC >>>> SmartCard using the snmp-ups driver. It generally works: upsmon will >>>> detect if the battery is low (I get an e-mail message); I can control >>>> the UPS, inspect it variables, set variables, issue commands, and so >>>> on. >>> >>> If "On battery" and "Low battery" are both detected, there should be no >>> issue. >>> >>>> There's just one thing that does not happen: when the UPS goes critical, >>>> the computer does not shut down. The upsmon daemon does not display any >>>> messages, does not write to the syslog, does not send e-mail, etc.; even >>>> though I've configured it to do so in upsmon.conf.>> >>>> I've tried nut-2.2.2, nut-2.4.3, and nut-2.6.2, and the symptom is the >>>> same. >>> >>> Using the latest version, when possible, is always a good idea. >> >> Installing nut-2.6.2 on a Scientific Linux 5.5 system was a bit difficult, >> and played havoc with my regular yum updates. After I've finished >> debugging this problem, I'm going to completely reinstall the OS to make >> sure I've got a consistent set of RPMs.>> > > you may have prefered to rebuild an SRPM like that: > http://zid-luxinst.uibk.ac.at/linux/rpm2html/fedora/14/i386/updates/nut-2.6.2-1.fc14.i686.html
That what I did, at first. The rebuild process for that RPM involves "-devel" libraries that are not part of an RHEL5-style distribution. So I tried to download and compile the SRPMs for those libraries (neon-devel, portman-devel, net-snmp-devel, etc.). This led to a chain of installs and the usual RPM hell; I had not appreciated how different RHEL6+ was from RHEL5. Even with all the dependent libraries installed, the nut-2.6.2 SRPM would still not rebuild; even though the neon and neon-devel libraries were present, the configure script couldn't find them and so the rebuild failed. Finally, I did what I should have done from the start: I just used the nut-2.6.2.tar.gz file and built it manually. The configure script still couldn't find the neon libraries, but I didn't need that functionality for my tests, and this did not block the compilation. The only problem was getting the various directory options set so files/binaries would be installed in the same directories as in a Redhat distribution. Even then, I had to move binaries around post-install. And after all that work, it still didn't solve the problem. Read on... >>>> I tried issuing a "graceful reboot" command via the APC SmartCard's web >>>> and telnet interface. It made no difference; the system still did not >>>> shut down. >>>> >>>> Now let's extend the problem to my cluster: I have a variety of >>>> different computers, all running Scientific Linux 5.5, connecting >>>> through different switches, connecting to different flavors of APC >>>> SMART-UPSes, via SmartCards, each ranging in age from six months to >>>> five years. They all exhibit this same symptom, as I painfully >>>> discovered during a recent power outage: they all sent me e-mail when >>>> the UPSes went to low battery, but none turned off when the UPS went >>>> critical. Given the range of hardware involved, this must be a common >>>> software problem. >>>> >>>> The systems will shut down properly if I do "upsmon -c fsd", so it >>>> doesn't appear to be a permissions problem. >>>> >>>> I don't think this is the upsdrv_shutdown() issue described in the >>>> snmp-ups man page; I do not care if the UPS shuts down when the >>>> computer does, nor do I want it to. I just want upsmon to shut down the >>>> system when the UPS goes critical. >>>> >>>> Here are my config files; the system is tanya, its UPS is tanya-ups. >>>> Any advice? >>>> >>>> ups.conf: >>>> >>>> [tanya-ups] >>>> driver = snmp-ups >>>> port = tanya-ups >>>> community = private >>>> mibs = apcc >>>> >>>> upsd.conf: >>>> >>>> # LISTEN 0.0.0.0 3493 >>>> >>>> upsd.users: >>>> >>>> [admin] >>>> password = nowayjose >>>> actions = SET >>>> instcmds = all >>>> upsmon master >>>> >>> >>> it's also a good idea to separate monitoring and administrative users. >>> Ie: >>> [admin] >>> password = XXX >>> actions = SET >>> instcmds = all >>> >>> [monuser] >>> password = XXX >>> upsmon master >>> >>>> upsmon.conf: >>>> >>>> MONITOR tanya-ups@localhost 1 admin nowayjose master >>>> MINSUPPLIES 1 >>>> SHUTDOWNCMD "/sbin/shutdown -h +0" >>>> NOTIFYCMD /home/bin/notify.sh # sends me e-mail >>>> POLLFREQ 5 >>>> POLLFREQALERT 5 >>>> HOSTSYNC 15 >>>> DEADTIME 15 >>>> POWERDOWNFLAG /etc/killpower >>>> NOTIFYFLAG ONLINE SYSLOG >>>> NOTIFYFLAG ONBATT SYSLOG+WALL >>>> NOTIFYFLAG LOWBATT SYSLOG+WALL >>>> NOTIFYFLAG FSD SYSLOG+WALL+EXEC >>>> NOTIFYFLAG COMMOK SYSLOG >>>> NOTIFYFLAG COMMBAD SYSLOG >>>> NOTIFYFLAG SHUTDOWN SYSLOG+WALL+EXEC >>>> NOTIFYFLAG REPLBATT SYSLOG+WALL+EXEC >>>> NOTIFYFLAG NOCOMM SYSLOG >>>> NOTIFYFLAG NOPARENT SYSLOG+WALL >>>> RBWARNTIME 43200 >>>> NOCOMMWARNTIME 300 >>>> FINALDELAY 5 >>> >>> Your config seems fine. >>> An interesting test to do would be to stop upsmon, but keep snmp-ups and >>> upsd, then discharge your UPS and to ensure that you indeed get an >>> ups.status == "OB LB", which triggers the call to >>> upsmon.conf->SHUTDOWNCMD. Note that you need both "OB" and "LB", since >>> you may have "low battery" and be "online" at the same time! >> >> This is a good idea, and I ran the test. I disconnected the UPS, and >> periodically checked the output of: >> >> upsc tanya-ups@localhost ups.status >> >> Eventually this command returned "OB LB" as you said. But upsmon did >> nothing. I waited and eventually the UPS shut power to the system in a hard >> crash. > > ooch, mea culpa! > I was too brief in my answer, and forgot to tell you the obvious: remove > your computer from the UPS, in order to avoid such crash. > >> So the UPS is sending the correct signals, and snmp-ups is reporting the >> correct status. Is there anything else I can check to trace the cause of >> the problem? > > indeed, though there is an issue, as you've reported initially. > > Could you do this test again, but this time: > - remove your server from the UPS, > - start upsmon in debug mode. If it's already started, just call "upsmon -c > stop ; upsmon -DDDDD" > and send us back the output, at least when it should see the "OB LB" > condition, to see what's going on. I solved the problem by looking at the code in upsmon.c. I did two stupid things: - I didn't RTFM as much as I thought I had. - In my rush to trim down the config files for my first message to nut-upsuser, I left out the crucial bits that would have enabled anyone else to help me. Here's the key: In my upsmon.conf, I actually have two MONITOR lines: MONITOR tanya-ups@localhost 1 monuser acdc master MONITOR network-ups@localhost 1 monuser acdc master (Note the change to "monuser", indicating that I followed Arnaud's advice.) I'm using snmp-ups to communicate with my UPS. If the UPS that supplies power to the network switch goes critical, I want tanya to power down as well; after all, if tanya can't talk to its UPS anymore, it won't know when tanya-ups goes critical. So the intent of the two MONITOR lines is: If either tanya-ups OR network-ups goes critical, shut down the system. But I also had this line in upsmon.conf: MINSUPPLIES 1 That means the effect of the two MONITOR lines is: If tanya-ups AND network-ups go critical, shut down the system. Since all my tests involved just cutting the power via tanya-ups, upsmon wasn't shutting down tanya. It was doing what the configuration file told it to do. The solution is change the MINSUPPLIES line: MINSUPPLIES 2 Then upsmon does what I want it to do. I've already confirmed this with direct tests. (I also discovered that I had to increase the "low-battery duration" parameter on tanya-ups, but that's another story.) In general, at least for my cluster configuration, the argument to MINSUPPLIES should be equal to the number of MONITOR lines I have in upsmon.conf. My confusion was due to my mis-interpretation of the language of the documentation. The upsmon.conf man page and big-servers.txt all speak about power supplies directly connected to the system; I skipped over those parts because I thought of only one UPS supplying power to my system. In my configuration I have to think of the network switch as part of "the system." I should have paid more attention. Thanks for trying to help me out, Arnaud. It wasn't your fault that I didn't give you enough information. -- Bill Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu PO Box 137 | Irvington NY 10533 USA | http://www.nevis.columbia.edu/~seligman/
smime.p7s
Description: S/MIME Cryptographic Signature
_______________________________________________ Nut-upsuser mailing list Nut-upsuser@lists.alioth.debian.org http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/nut-upsuser