1) syslog errors every 20+ minutes or so like : Aug 7 10:21:03 ben usbhid-ups[3321]: libusb_get_string: error sending control message: Broken pipe

Not a cause of concern. It is a way of telling that the UPS is currently not able to handle a command. Most likely this is due to the UPS doing some internal housekeeping functions and the little microcontroller inside is not able to handle a command. We will probably suppress this in future NUT versions, as it is a common cause of false alarms.

2) syslog errors on a similar timescale like : Aug 7 08:17:40 ben kernel: [40170.402789] usb 2-1.2: usbfs: USBDEVFS_CONTROL failed cmd usbhid-ups rqt 161 rq 1 len 8 ret -110

Same here. The kernel is informing you that the UPS didn't respond to a command (110 = ETIMEDOUT). The cause is most likely the same as the above and not a cause of concern either. Unlike the above message, there is nothing we can do about this as it is logged by the kernel.
Good to know.  Thanks for the reply.

3) The machine spontaneously shutdown this morning due to a "low battery" condition. However, 80 minutes later when I noticed the UPS battery was at 100%. I don't think it can charge that fast, so I think this must have been a communication error.

I'm not so sure about that. Don't overestimate the accuracy of the battery charge gauge on the UPS. It could be that it is just voltage based, which means that it will indicate full charge long before the battery is actually full. It could also mean that the battery is bad. This may cause nearly instant shutdowns when the mains fails (when the battery is under load) while it looks like the battery is (almost) full with the mains present (and the battery is not under load). Running a battery test usually reveals what is going in.

Best regards, Arjen
Fair points, but I think the battery is good. I've run several on-battery shutdowns lasting 90s+ by flipping the breaker (using upssched to initiate shutdown after 60s) and that works fine. I ran a longer test for 2 or 3 minutes once and watched the UPS displayed estimated run time count down from 76 minutes as you'd expect it to. The UPS is brand new. Also, I suspect there was no power cut - in the past I've had to reset my stove clock after a power cut, and I don't recall having to do that this time. I have my system set up to shutdown after 5 mins on battery, rather than wait for a lowbatt condition, so I doubt the low batt could have been reached due to a power cut.... unless perhaps it was a night of successive 4 minute power cuts or, given the stove, 4 minute low-voltage conditions. I guess we'll never know for sure... in any case, this was enough for me to abandon 2.4.3 and try Cyberpower's own offering, which suffered from curious delays itself which I wasn't happy with given that the eventual power-off is timer based rather than signal based as in nut, and thence back to nut 2.2.2...

So, I went back to nut 2.2.2 under Debian Lenny with both MAXAGE and DEADTIME set to 150s. This worked OK for 10 days, with the odd type (2) error from above, and the odd stale data error [aside : it is my understanding that data must now be stale for 150s for upsmon to log a stale data warning to syslog, since upsd doesn't pass on the stale data condition until MAXAGE is reached. So for 30 lots of 5s polls the data is stale... then it shows up in syslog, and, and this is what's weird, in almost every case it resolves itself 2s later.... just like it did when MAXAGE was 15s.] After 10 days it went into a stale data condition that continued all night.... until I stopped it by restarting nut in the morning.

Since restarting nut seemed to fix the problem I decided to make upssched restart nut on a NOCOMM condition. I'll briefly describe how I did that here in case others are interested:

I set

NOTIFYFLAG NOCOMM       SYSLOG+WALL+EXEC

in upsmon.conf in the usual way with

NOTIFYCMD /sbin/upssched

set to call upssched.  In upssched.conf I set

CMDSCRIPT /sbin/upsschedcmd

and

AT NOCOMM   * EXECUTE restart

/sbin/upsschedcmd is my command script, the relevant portion of which is :

#!/bin/bash
# This script is called by upssched on a UPS event. # This script is designed to be run by user nut.

case $1 in
 restart)
   /sbin/upsrestart.x
   ;;
esac

upsrestart.x is the following C code, compiled using the gcc line in the comment, and chowned/chmoded to have the ownership/permissions in the 2nd comment :


#include <stdio.h>
#include <unistd.h>

/*

This program is designed to restart nut.
The binary file permissions should be -rwsr-xr-- root:nut

gcc -g -Wall -o upsrestart.x upsrestart.c

*/

int main (int argc, char *argv[])
{

 char *arg[] = { "/etc/init.d/nut", "restart", (char *) NULL };

char *env[] = { "USER=root", "PATH=/usr/sbin:/usr/bin:/sbin:/bin", "HOME=/root", (char *) NULL };

 execve (arg[0], arg, env);

 // if execve() returns there has been an error

 fprintf(stderr,"upsrestart.c : error calling execve()\n");

 return(0);

}




What happens is that upssched runs /sbin/upsschedcmd as user nut, which runs the setuid program upsrestart.x as nut which runs /etc/init.d/nut restart as effective user root, restarting nut and, it appears so far, reestablishing connection with the UPS. Since this runs on NOCOMM, default timeout 300s, that becomes the max time your system can't talk to your UPS. Since I have DEADTIME set to 150s, a stale UPS that was last known to be on battery will shutdown before the NOCOMM restart takes effect. The binary wrapper is necessary because Linux ignores setuid bits applied to scripts. Furthermore, modern versions of bash drop setuid privileges on startup, unless called with -p. The /etc/init.d/nut script uses /bin/sh. The above works on Debain because, according to the "system" man page (of all places :) : "Debian uses a modified bash which does not do this when invoked as sh". On other flavours of Linux you may need to tweak the first line of /etc/init.d/nut to prevent it dropping privileges.

I think the above is safe because the binary can only restart nut, nothing else, and can only be run by root or nut. I'm not exactly a security expert though, so I might be wrong.

Anyway, I setup the above 10 days ago, and this morning it triggered. I have it configured to send me an email too. It sent one email, and restarted nut successfully. Comms were reestablished. The only thing that didn't go entirely according to plan is that the old upsmon stuck around as a defunct nut process and a running root process. I don't know why they didn't die, but they were easily killed off later manually. It was definitely better to get one email and comms reestablished after 5 minutes than 70 emails and no communications all night.

best
/rob


_______________________________________________
Nut-upsuser mailing list
Nut-upsuser@lists.alioth.debian.org
http://lists.alioth.debian.org/mailman/listinfo/nut-upsuser

Reply via email to