Hi,
I ran the strace command while upssched was 100% CPU hungry. This is what I got:
1686633611.702798 read(7, "", 1) = 0 <0.000004>
1686633611.702816 read(7, "", 1) = 0 <0.000004>
1686633611.702834 pselect6(11, [7 10], NULL, NULL, {tv_sec=1, tv_nsec=0}, NULL)
= 1 (in [7], left {tv_sec=0, tv_nsec=999998800}) <0.000006>
1686633611.702862 read(7, "", 1) = 0 <0.000004>
1686633611.702880 read(7, "", 1) = 0 <0.000004>
.
.
.
1686633611.705206 read(7, "", 1) = 0 <0.000004>
1686633611.705224 read(7, "", 1) = 0 <0.000004>
1686633611.705242 pselect6(11, [7 10], NULL, NULL, {tv_sec=1, tv_nsec=0}, NULL)
= 1 (in [7], left {tv_sec=0, tv_nsec=999999204}) <0.000005>
1686633611.705268 read(7, "", 1) = 0 <0.000004>
1686633611.705288 read(7, "", 1) = 0 <0.000302>
there are 128 "read(7" lines between every pselect6 lines (I did count the read
lines for couple of blocks, 128 in each of them). Nothing else in the strace
log, just read and pselect lines for the short time I let strace to run.
Any help from this?
Best regards,
Kari
________________________________
From: Jim Klimov <[email protected]>
Sent: Tuesday, June 13, 2023 00:35
To: Kari Lempiäinen <[email protected]>
Cc: Arnaud Quette via Nut-upsuser <[email protected]>
Subject: Re: [Nut-upsuser] Upssched 100% CPU after updating Debian 12
Hello,
Thank you for the report and I don't recall any similar issues.
Unfortunately, upssched did not have any "native" enablement of debug
verbosity in 2.8.0 and older releases (some added with
https://github.com/networkupstools/nut/pull/1864 recently), so maybe the best
way to get more info at the moment would be to watch syscalls made when it
begins to hog CPU and make educated guesses from there, e.g.:
# strace -f -ttt -T -p "`ps -ef | grep -w upssched | grep -v grep | awk
'{print $2}'`"
I suppose when hogging the CPU it has some reason to quickly interrupt its
processing loop - perhaps a lack of some file (socket?) it waits on, etc.
Hope this helps,
Jim Klimov
On Mon, Jun 12, 2023, 14:59 Kari Lempiäinen
<[email protected]<mailto:[email protected]>> wrote:
Hi,
I have been running nut successfully for a long time with my Debian 11 server.
I upgraded my server to Debian 12 today, which upgraded nut also from 2.7.4-13
to 2.8.0-7. I noticed that after upgrade there was a upssched process running
and taking 100% cpu time.
I checked if there were any changes to configuration file formats with nut
upgrade and only differences I noticed were a terminology change from
master/slave to primary/secondary in upsd.users and upsmon.conf, and dummy mode
change for the dummy-ups. I didn’t expect these being the culprit, but I made
the changes to my configuration. No change, after five minutes of starting nut,
upssched will be running with 100% cpu usage.
Any ideas what could be causing this? Here is my configuration and in the end
are some log snippets:
---------------------------------------------------
heartbeat.conf
# heartbeat.conf -- 10 minute heartbeat
ups.status: OL
TIMER 300
ups.status: OB
TIMER 300
---------------------------------------------------
hosts.conf
MONITOR eaton5s@localhost "Eaton 5S 700"
---------------------------------------------------
nut.conf
MODE=netserver
---------------------------------------------------
ups.conf
maxretry = 3
[eaton5s]
driver = usbhid-ups
port = auto
desc = "Eaton 5S 700i UPS"
vendorid = "0463"
productid = "FFFF"
pollinterval = 15
[heartbeat]
driver = dummy-ups
mode=dummy-loop
port = heartbeat.conf
desc = "Watch over NUT"
---------------------------------------------------
upsd.conf
LISTEN 0.0.0.0 3493
---------------------------------------------------
upsmon.conf
RUN_AS_USER root
MONITOR eaton5s@localhost 1 upsmaster xxx primary
MONITOR heartbeat@localhost 0 upsmaster xxx primary
MINSUPPLIES 1
NOTIFYCMD /sbin/upssched
SHUTDOWNCMD "/sbin/shutdown -h +0"
POLLFREQ 5
POLLFREQALERT 5
HOSTSYNC 15
DEADTIME 15
POWERDOWNFLAG /etc/killpower
RBWARNTIME 43200
NOCOMMWARNTIME 300
FINALDELAY 5
NOTIFYMSG ONLINE "UPS %s: On line power."
NOTIFYMSG ONBATT "UPS %s: On battery."
NOTIFYMSG LOWBATT "UPS %s: Battery is low."
NOTIFYMSG REPLBATT "UPS %s: Battery needs to be replaced."
NOTIFYMSG FSD "UPS %s: Forced shutdown in progress."
NOTIFYMSG SHUTDOWN "Auto logout and shutdown proceeding."
NOTIFYMSG COMMOK "UPS %s: Communications (re-)established."
NOTIFYMSG COMMBAD "UPS %s: Communications lost."
NOTIFYMSG NOCOMM "UPS %s: Not available."
NOTIFYMSG NOPARENT "upsmon parent dead, shutdown impossible."
NOTIFYFLAG ONLINE EXEC
NOTIFYFLAG ONBATT EXEC
NOTIFYFLAG LOWBATT SYSLOG+WALL
NOTIFYFLAG REPLBATT SYSLOG+WALL
NOTIFYFLAG FSD SYSLOG+WALL
NOTIFYFLAG SHUTDOWN SYSLOG+WALL
NOTIFYFLAG COMMOK SYSLOG+WALL
NOTIFYFLAG COMMBAD SYSLOG+WALL
NOTIFYFLAG NOCOMM SYSLOG+WALL
NOTIFYFLAG NOPARENT SYSLOG+WALL
---------------------------------------------------
upssched.conf
CMDSCRIPT /etc/nut/upssched-cmd
PIPEFN /var/run/nut/upssched.pipe
LOCKFN /var/run/nut/upssched.lock
AT ONBATT eaton5s@localhost EXECUTE onbatt
AT ONBATT eaton5s@localhost START-TIMER beeperoff 15
AT ONLINE eaton5s@localhost EXECUTE online
AT ONLINE eaton5s@localhost CANCEL-TIMER beeperoff
AT ONBATT eaton5s@localhost START-TIMER shutdown-ws2019 300
AT ONLINE eaton5s@localhost CANCEL-TIMER shutdown-ws2019
AT ONBATT heartbeat@localhost CANCEL-TIMER heartbeat-failure-timer
AT ONBATT heartbeat@localhost START-TIMER heartbeat-failure-timer 660
---------------------------------------------------
upsd.users
[upsmaster]
password = xxx
upsmon primary
actions = set
actions = fsd
instcmds = all
[upsslave]
password = xxx
upsmon secondary
---------------------------------------------------
upssched-cmd
#! /bin/bash
UPS="eaton5s"
UPS_USERNAME="upsmaster"
UPS_PASSWORD="xxx"
[email protected]<mailto:[email protected]>
[email protected]<mailto:[email protected]>
STATUS=`upsc $UPS ups.status 2>&1 | grep -v '^Init SSL'`
CHARGE=`upsc $UPS battery.charge 2>&1 | grep -v '^Init SSL'`
CHMSG="[$STATUS]:$CHARGE%"
MSG="dummy message - $1"
case $1 in
online) MSG="$UPS, $CHMSG - On line power - power supply had been
restored." ;;
onbatt) MSG="$UPS, $CHMSG - On battery - power failure - save your
work!" ;;
lowbatt) MSG="$UPS, $CHMSG - Battery is low - shutdown now!" ;;
beeperoff)
logger -i -t upsshed-cmd "Turning off Beeper"
STATUS=`upsc ${UPS} ups.beeper.status 2>&1 | grep -v '^Init
SSL'`
if [[ "${STATUS}" == "enabled" ]] ; then
upscmd -u ${UPS_USERNAME} -p ${UPS_PASSWORD} ${UPS}
beeper.toggle
fi
;;
shutdown-ws2019)
MSG="$UPS, $CHMSG - Stopping WS2019"
logger -i -t upsshed-cmd "Stopping WS2019"
setsid /usr/local/bin/shutdown-ws2019 &
;;
upsgone)
logger -i -t upssched-cmd "The UPS has been gone for awhile"
;;
heartbeat-failure-timer)
MSG="NUT heart beat fails. $CHMSG"
# Email
MSG1="Hello, upssched-cmd reports NUT heartbeat has failed."
MSG2="Current status: $CHMSG \n\n$0 $1"
MSG3="\n\n$( ps -elf | grep -E 'ups[dms]|nut' )"
echo -e "$MSG1 $MSG2 $MSG3" | /usr/bin/mail -r "$EMAIL_FROM" \
-s "NUT heart beat fails. Currently $CHMSG" "$EMAIL_TO" ;;
*)
logger -i -t upssched-cmd "Unrecognized command: $1"
;;
esac
logger -i -t upssched-cmd $MSG
wall "$MSG"
---------------------------------------------------
---------------------------------------------------
This is normal ups activity in daemon log from the previous version:
Jun 10 23:10:50 fricka upssched[16007]: Cancelling timer:
heartbeat-failure-timer
Jun 10 23:10:50 fricka upssched[16007]: New timer: heartbeat-failure-timer (660
seconds)
Jun 10 23:20:50 fricka upssched[16007]: Cancelling timer:
heartbeat-failure-timer
Jun 10 23:20:50 fricka upssched[16007]: New timer: heartbeat-failure-timer (660
seconds)
Jun 10 23:30:55 fricka upssched[16007]: Cancelling timer:
heartbeat-failure-timer
Jun 10 23:30:55 fricka upssched[16007]: New timer: heartbeat-failure-timer (660
seconds)
Jun 10 23:40:55 fricka upssched[16007]: Cancelling timer:
heartbeat-failure-timer
Jun 10 23:40:55 fricka upssched[16007]: New timer: heartbeat-failure-timer (660
seconds)
Jun 10 23:50:55 fricka upssched[16007]: Cancelling timer:
heartbeat-failure-timer
Jun 10 23:50:55 fricka upssched[16007]: New timer: heartbeat-failure-timer (660
seconds)
---------------------------------------------------
This is daemon log from the new version, first two lines are the last lines of
the startup messages. One same upssched process (124835) running with 100% cpu
from 21:07:
Jun 11 21:02:40 fricka dummy-ups[121803]: sock_connect: enabling asynchronous
mode (auto)
Jun 11 21:02:41 fricka usbhid-ups[121820]: sock_connect: enabling asynchronous
mode (auto)
Jun 11 21:07:40 fricka upssched[124835]: Timer daemon started
Jun 11 21:07:40 fricka upssched[124835]: New timer: heartbeat-failure-timer
(660 seconds)
Jun 11 21:07:40 fricka nut-monitor[124832]: Network UPS Tools upsmon 2.8.0
Jun 11 21:12:40 fricka nut-monitor[128005]: Network UPS Tools upsmon 2.8.0
Jun 11 21:17:40 fricka upssched[124835]: Cancelling timer:
heartbeat-failure-timer
Jun 11 21:17:40 fricka upssched[124835]: New timer: heartbeat-failure-timer
(660 seconds)
Jun 11 21:17:40 fricka nut-monitor[131233]: Network UPS Tools upsmon 2.8.0
Jun 11 21:22:45 fricka nut-monitor[134330]: Network UPS Tools upsmon 2.8.0
Jun 11 21:27:45 fricka upssched[124835]: Cancelling timer:
heartbeat-failure-timer
Jun 11 21:27:45 fricka upssched[124835]: New timer: heartbeat-failure-timer
(660 seconds)
Jun 11 21:27:45 fricka nut-monitor[137838]: Network UPS Tools upsmon 2.8.0
---------------------------------------------------
_______________________________________________
Nut-upsuser mailing list
[email protected]<mailto:[email protected]>
https://alioth-lists.debian.net/cgi-bin/mailman/listinfo/nut-upsuser
_______________________________________________
Nut-upsuser mailing list
[email protected]
https://alioth-lists.debian.net/cgi-bin/mailman/listinfo/nut-upsuser