I had some more time to look into this. I've upgraded several
machines to more recent snapshots (which includes rzalamena's commit
to util.c and reyk's commit to log.c) but still see this. It happens
on 'real' amd64 machines, KVM vm's running amd64 and an i386 machine
too.
I've now ktraced with -i next to -d (-i was missing previously, so the
traces weren't very useful - apologies). The ktrace and kdump output
are rather large (60.2M for the ktrace, 58.5M for the kdump) for the
i386 machine that was first to hit it again after running under
ktrace, so I'm not attaching those (happy to make those available on
request).
I was looking for "exit" in the kdump output and I see regularly
occuring exits (from what looks like the constraint process).
However, at some point there's this (kdump | cat -n):
1125061 87570 ntpd CALL socket(AF_INET,0x1<SOCK_STREAM>,0x6)
1125062 87570 ntpd RET socket 4
1125063 87570 ntpd CALL kbind(0xcf7f0248,12,0x5fdcc0e48ad4ad53)
1125064 87570 ntpd RET kbind 0
1125065 87570 ntpd CALL connect(4,0x809207e0,16)
1125066 87570 ntpd STRU struct sockaddr { AF_INET, 199.16.156.102:443 }
1125067 16298 ntpd PSIG SIGTERM SIG_DFL
1125068 83655 ntpd STRU struct pollfd [4] { fd=3, events=0x1<POLLIN>,
revents=0<> } { fd=4, events=0x1<POLLIN>, revents=0<> } { fd=5,
events=0x1<POLLIN>, revents=0<> } { fd=17, events=0x1<POLLIN>, revents=0<> }
1125069 83655 ntpd PSIG SIGTERM caught handler=0x165af6e0 mask=0<>
1125070 83655 ntpd RET poll -1 errno 4 Interrupted system call
1125071 83655 ntpd CALL sigreturn(0xcf7cdf6c)
1125072 83655 ntpd RET sigreturn JUSTRETURN
1125073 83655 ntpd CALL sendmsg(3,0xcf7ce23c,0)
1125074 83655 ntpd STRU struct msghdr { name=0x0, namelen=0,
iov=0xcf7cc23c, iovlen=0, control=0x0, controllen=0, flags=0 }
1125075 83655 ntpd RET sendmsg -1 errno 32 Broken pipe
1125076 83655 ntpd CALL kbind(0xcf7cc1a8,12,0x2af7a8a1a984823f)
1125077 83655 ntpd RET kbind 0
1125078 83655 ntpd CALL sendmsg(4,0xcf7ce23c,0)
1125079 83655 ntpd STRU struct msghdr { name=0x0, namelen=0,
iov=0xcf7cc23c, iovlen=0, control=0x0, controllen=0, flags=0 }
1125080 83655 ntpd RET sendmsg 0
1125081 83655 ntpd CALL write(2,0xcf7cdd5e,0x13)
1125082 83655 ntpd GIO fd 2 wrote 19 bytes
1125083 "ntp engine exiting
1125084 "
and a bit later:
1127174 87570 ntpd CALL munmap(0x7f568000,0x1000)
1127175 87570 ntpd RET munmap 0
1127176 87570 ntpd CALL exit(0)
1127177 38871 ntpd RET wait4 87570/0x15612
1127178 38871 ntpd CALL wait4(WAIT_ANY,0,0<>,0)
1127179 38871 ntpd RET wait4 -1 errno 10 No child processes
1127180 38871 ntpd CALL getpid()
1127181 38871 ntpd RET getpid 38871/0x97d7
1127182 38871 ntpd CALL sendsyslog(0xcf7dbaf7,28,0<>)
1127183 38871 ntpd GIO fd -1 wrote 28 bytes
1127184 "<30>ntpd[38871]: Terminating"
1127185 38871 ntpd RET sendsyslog 0
1127186 38871 ntpd CALL kbind(0xcf7de048,12,0xc1820ba2e9e4674a)
1127187 38871 ntpd RET kbind 0
and then PID 38871 also exits, and then that's that.
So, there's this pid 16298 in there that gets SIGTERM for some reason
(line 1125067). This is also a constraint process:
1090139 16298 ntpd NAMI "/usr/sbin/ntpd"
1090140 16298 ntpd ARGS
1090141 [0] = "ntpd"
1090142 [1] = "-P"
1090143 [2] = "constraint"
1090144 [3] = "-v"
1090145 16298 ntpd NAMI "/usr/libexec/ld.so"
Which, apart from the SIGTERM seems to behave quite like the other
constraint processes (from casual reading of the kdump output, at
least).
Any ideas what I should look for now?
Paul 'WEiRD' de Weerd
On Sun, Oct 02, 2016 at 06:37:04PM +0200, Sebastien Marie wrote:
| Hi Paul,
|
| Just to saying that I saw this behaviour. More details below.
|
| On Sun, Oct 02, 2016 at 06:15:42PM +0200, Paul de Weerd wrote:
| >
| > Since upgrading several machines to a more current snapshot (in an
| > effort to keep up to date with LibreSSL fixes), I noticed ntpd started
| > failing. Here's what I see on my laptop:
| >
| > Oct 2 17:50:15 drop doas: weerd ran command rcctl start ntpd as root from
/home/weerd
| > Oct 2 17:50:28 drop ntpd[17059]: Terminating
| > Oct 2 17:52:50 drop doas: weerd ran command rcctl start ntpd as root from
/home/weerd
| > Oct 2 17:53:03 drop ntpd[27986]: Terminating
| > Oct 2 17:53:18 drop doas: weerd ran command ktrace -d ntpd -v as root from
/home/weerd
| > Oct 2 17:53:30 drop ntpd[97855]: Terminating
| >
| > I have ntpd_flags="-v" in my /etc/rc.conf.local, but the above is all
| > the logging output I get (that's from my *.* syslog file, that also
| > includes DEBUG output).
|
| for me, no override of ntpd_flags.
|
| ntpd.conf configured as:
| servers fr.pool.ntp.org
| constraints from www.duckduckgo.com
|
| > So far, 5 machines have been upgraded and 4 of them show this behavior
| > (although my laptop is now really fast to exit). These are all amd64
| > machines (because I haven't gotten around to upgrading the few i386 I
| > still have running yet). They're 'real' and virtual (KVM) machines.
|
| I saw it on two amd64 hosts. I have i386 hosts (but the config is
| differents: use local servers).
|
| > Has anyone else seen this behavior?
|
| yes.
|
| I also tried to run with kern.nosuidcoredump=3 but the terminating don't
| generated any coredump in /var/crash/ntpd.
|
| the process terminaison isn't automatic: my ntpd is now running fine
| since almost 1 day.
| --
| Sebastien Marie
|
--
>++++++++[<++++++++++>-]<+++++++.>+++[<------>-]<.>+++[<+
+++++++++++>-]<.>++[<------------>-]<+.--------------.[-]
http://www.weirdnet.nl/