Hi all, We have recently bought an APC UPS and are in the process of setting up the NUT software to make use of it. We are experiencing a problem with the behaviour of the slave systems when the master system goes off line. Although the failure of our master system will (hopefully) be a rare event, and we hope not to experience too many power outages, it is possible (if unlikely) that both circumstances will occur at the same time. I have searched the list, but not found anyone else with this problem. We would appreciate some help and advice if possible.
I will first give a very brief overview of our set up, then detail the problem, and finally provide detailed information on our set up and its configuration. ++ Brief overview of set up. Our APC UPS is attached to a PC by a serial cable. This PC acts as the NUT master system (with NUT server and client software installed) and is connected to the network. Two other systems act as NUT slave systems (have NUT client software installed), these are also attached to the network and monitor the master system using this network connection. This is a test rig. It has shown the NUT software and UPS to operate very successfully in many different circumstances. As stated above, the circumstances that lead to our problem should be rare. ++ Details of the problem. Problem _______ We have conducted some tests in which the master PC is unexpectedly shut down when the UPS is On Line (OL) and On Battery (OB). Both tests showed that the slave systems did not register the loss of the master system for 15 minutes. This period of time is too great because the fully charged battery of the UPS will probably not last for 15 minutes, and there is no guarantee that such a failure will occur with a fully charged battery. Our Understanding of the Expected NUT Behaviour _______________________________________________ It is our understanding that the NUT software process "upsmon" is responsible for monitoring the "upsd" process on the master system that provides information about the state of the UPS. Each slave system can set parameters for the upsmon process (using the NUT configuration file "upsmon.conf"). One of these parameters is called "DEADTIME". The man page for upsmon (upsmon.8) states: DEAD UPSES In the event that upsmon can’t reach upsd(8), it declares that UPS dead after some interval controlled by DEADTIME in the upsmon.conf(5). If this happens while that UPS was last known to be on battery, it is assumed to have gone critical and no longer contributes to the overall power value. The parameter DEADTIME has units of seconds. This parameter is set to "15" by default, indicating that after 15 seconds of being unable to contact the master's upsd process, the slave upsmon process should make a decision on whether to shut the system down. (The decision is based on the last know state of the UPS [OL or OB] and whether the system has an alternative power source.) Modifications have been made to this parameter on the slave systems; these changes have not affected the 15 minute delay between the shut down of the master and the registering of the absence of the master upsd process by the slaves. We expect that if the UPS is OB and the master system is shut down, the slaves will begin to shut down after a DEADTIME second delay. It is clear that something other than the upsmon DEADTIME parameter is affecting the behaviour of the slaves, but we don't know how to alter this. A Guess at the Root of this Problem ___________________________________ We have done a little bit of further investigation to try to understand what is going on and what we are doing wrong. By running a slave upsmon process with a debugging flag set it can be seen that the 15 minute delay occurs as a result of the upsmon's poll of the master's upsd process. Once the master has gone off line, the slave upsmon reports: polling ups: [EMAIL PROTECTED] get_var: [EMAIL PROTECTED] / status and then 'hangs'. A 15 minute delay follows before the polling process returns that the master's upsd process is not reachable. A brief examination of the NUT source code indicates that a system "write" statement is being used to communicate across the network with the upsd process of the master. We think that this system function blocks by default. Maybe the default blocking settings are in use. We don't know, this is probably very wide of the mark, but it is the best we have come up with! We are expecting this problem to be caused by our set up and configuration of the NUT software. Has anyone seen similar behaviour? Does anyone have any suggestions on how to fix this problem? Any sharing of knowledge or suggestions will be appreciated. Best wishes, Jon Clark ++ Details about the set up In almost all cases, the default configuration settings are in use where possible. Master Configuration Files __________________________ ups.conf -------- $ grep -v "#" ups.conf [apcups] driver = apcsmart port = /dev/ttyS0 upsd.conf --------- $ grep -v "#" upsd.conf ACL all 0.0.0.0/0 ACL localhost 127.0.0.1/32 ACL nutMaster xx.xx.xx.xx1/32 ACL nutSlave1 xx.xx.xx.xx7/32 ACL nutSlave2 xx.xx.xx.xx3/32 ACCEPT localhost nutMaster nutSlave1 nutSlave2 REJECT all upsd.users ---------- $ grep -v "#" upsd.users [upsadmin] password = **** allowfrom = nutMaster actions = SET instcmds = ALL [monmaster] password = **** allowfrom = nutMaster upsmon master [monslave-nutSlave1] password = **** allowfrom = nutSlave1 upsmon slave [monslave-nutSlave2] password = **** allowfrom = nutSlave2 upsmon slave upsmon.conf ----------- $ grep -v "#" upsmon.conf MONITOR [EMAIL PROTECTED] 1 monmaster **** master MINSUPPLIES 1 SHUTDOWNCMD "/sbin/shutdown -h +0" POLLFREQ 5 POLLFREQALERT 5 HOSTSYNC 15 DEADTIME 15 POWERDOWNFLAG /etc/killpower RBWARNTIME 43200 NOCOMMWARNTIME 300 FINALDELAY 5 Slave Configuration Files _________________________ (Both slaves have similar settings and exhibit similar behaviour.) upsmon.conf ----------- $ grep -v "#" upsmon.conf MONITOR [EMAIL PROTECTED] 1 monslave-nutSlave1 **** slave MINSUPPLIES 1 SHUTDOWNCMD "/sbin/shutdown -h +0" POLLFREQALERT 5 HOSTSYNC 15 DEADTIME 15 POWERDOWNFLAG /etc/killpower NOCOMMWARNTIME 300 FINALDELAY 0 Computer Operating Systems __________________________ nutMaster: Scientific Linux 4.4 nutSlave1: Scientific Linux 4.1 (Scientific Linux is a Redhat Enterprise recompile.) NUT Software Versions _____________________ nutMaster: - nut-2.2.0-3.3.el4.i386.rpm - nut-client-2.2.0-3.3.el4.i386.rpm nutSlave1: - nut-client-2.2.0-3.3.el4.i386.rpm UPS Details ___________ Brand: APC Model: Smart-UPS RT 8000VA RM 230V (XLI) -- ---------------------------- Jon Clark Scientific Officer Dept. of Applied Mathematics University of Sheffield Sheffield, S3 7RH, UK ---------------------------- _______________________________________________ Nut-upsuser mailing list [email protected] http://lists.alioth.debian.org/mailman/listinfo/nut-upsuser

