Re: ntpd failing during runtime without indication why

Paul de Weerd Tue, 18 Oct 2016 13:42:32 -0700

I had some more time to look into this.  I've upgraded several
machines to more recent snapshots (which includes rzalamena's commit
to util.c and reyk's commit to log.c) but still see this.  It happens
on 'real' amd64 machines, KVM vm's running amd64 and an i386 machine
too.


I've now ktraced with -i next to -d (-i was missing previously, so the
traces weren't very useful - apologies).  The ktrace and kdump output
are rather large (60.2M for the ktrace, 58.5M for the kdump) for the
i386 machine that was first to hit it again after running under
ktrace, so I'm not attaching those (happy to make those available on
request).

I was looking for "exit" in the kdump output and I see regularly
occuring exits (from what looks like the constraint process).
However, at some point there's this (kdump | cat -n):

1125061  87570 ntpd     CALL  socket(AF_INET,0x1<SOCK_STREAM>,0x6)
1125062  87570 ntpd     RET   socket 4
1125063  87570 ntpd     CALL  kbind(0xcf7f0248,12,0x5fdcc0e48ad4ad53)
1125064  87570 ntpd     RET   kbind 0
1125065  87570 ntpd     CALL  connect(4,0x809207e0,16)
1125066  87570 ntpd     STRU  struct sockaddr { AF_INET, 199.16.156.102:443 }
1125067  16298 ntpd     PSIG  SIGTERM SIG_DFL
1125068  83655 ntpd     STRU  struct pollfd [4] { fd=3, events=0x1<POLLIN>, 
revents=0<> } { fd=4, events=0x1<POLLIN>, revents=0<> } { fd=5, 
events=0x1<POLLIN>, revents=0<> } { fd=17, events=0x1<POLLIN>, revents=0<> }
1125069  83655 ntpd     PSIG  SIGTERM caught handler=0x165af6e0 mask=0<>
1125070  83655 ntpd     RET   poll -1 errno 4 Interrupted system call
1125071  83655 ntpd     CALL  sigreturn(0xcf7cdf6c)
1125072  83655 ntpd     RET   sigreturn JUSTRETURN
1125073  83655 ntpd     CALL  sendmsg(3,0xcf7ce23c,0)
1125074  83655 ntpd     STRU  struct msghdr { name=0x0, namelen=0, 
iov=0xcf7cc23c, iovlen=0, control=0x0, controllen=0, flags=0 }
1125075  83655 ntpd     RET   sendmsg -1 errno 32 Broken pipe
1125076  83655 ntpd     CALL  kbind(0xcf7cc1a8,12,0x2af7a8a1a984823f)
1125077  83655 ntpd     RET   kbind 0
1125078  83655 ntpd     CALL  sendmsg(4,0xcf7ce23c,0)
1125079  83655 ntpd     STRU  struct msghdr { name=0x0, namelen=0, 
iov=0xcf7cc23c, iovlen=0, control=0x0, controllen=0, flags=0 }
1125080  83655 ntpd     RET   sendmsg 0
1125081  83655 ntpd     CALL  write(2,0xcf7cdd5e,0x13)
1125082  83655 ntpd     GIO   fd 2 wrote 19 bytes
1125083        "ntp engine exiting
1125084        "

and a bit later:

1127174  87570 ntpd     CALL  munmap(0x7f568000,0x1000)
1127175  87570 ntpd     RET   munmap 0
1127176  87570 ntpd     CALL  exit(0)
1127177  38871 ntpd     RET   wait4 87570/0x15612
1127178  38871 ntpd     CALL  wait4(WAIT_ANY,0,0<>,0)
1127179  38871 ntpd     RET   wait4 -1 errno 10 No child processes
1127180  38871 ntpd     CALL  getpid()
1127181  38871 ntpd     RET   getpid 38871/0x97d7
1127182  38871 ntpd     CALL  sendsyslog(0xcf7dbaf7,28,0<>)
1127183  38871 ntpd     GIO   fd -1 wrote 28 bytes
1127184        "<30>ntpd[38871]: Terminating"
1127185  38871 ntpd     RET   sendsyslog 0
1127186  38871 ntpd     CALL  kbind(0xcf7de048,12,0xc1820ba2e9e4674a)
1127187  38871 ntpd     RET   kbind 0

and then PID 38871 also exits, and then that's that.


So, there's this pid 16298 in there that gets SIGTERM for some reason
(line 1125067).  This is also a constraint process:

1090139  16298 ntpd     NAMI  "/usr/sbin/ntpd"
1090140  16298 ntpd     ARGS
1090141         [0] = "ntpd"
1090142         [1] = "-P"
1090143         [2] = "constraint"
1090144         [3] = "-v"
1090145  16298 ntpd     NAMI  "/usr/libexec/ld.so"

Which, apart from the SIGTERM seems to behave quite like the other
constraint processes (from casual reading of the kdump output, at
least).

Any ideas what I should look for now?

Paul 'WEiRD' de Weerd

On Sun, Oct 02, 2016 at 06:37:04PM +0200, Sebastien Marie wrote:
| Hi Paul,
| 
| Just to saying that I saw this behaviour. More details below.
| 
| On Sun, Oct 02, 2016 at 06:15:42PM +0200, Paul de Weerd wrote:
| > 
| > Since upgrading several machines to a more current snapshot (in an
| > effort to keep up to date with LibreSSL fixes), I noticed ntpd started
| > failing.  Here's what I see on my laptop:
| > 
| > Oct  2 17:50:15 drop doas: weerd ran command rcctl start ntpd as root from 
/home/weerd
| > Oct  2 17:50:28 drop ntpd[17059]: Terminating
| > Oct  2 17:52:50 drop doas: weerd ran command rcctl start ntpd as root from 
/home/weerd
| > Oct  2 17:53:03 drop ntpd[27986]: Terminating
| > Oct  2 17:53:18 drop doas: weerd ran command ktrace -d ntpd -v as root from 
/home/weerd
| > Oct  2 17:53:30 drop ntpd[97855]: Terminating
| > 
| > I have ntpd_flags="-v" in my /etc/rc.conf.local, but the above is all
| > the logging output I get (that's from my *.* syslog file, that also
| > includes DEBUG output).
| 
| for me, no override of ntpd_flags.
| 
| ntpd.conf configured as:
|       servers fr.pool.ntp.org
|       constraints from www.duckduckgo.com
| 
| > So far, 5 machines have been upgraded and 4 of them show this behavior
| > (although my laptop is now really fast to exit).  These are all amd64
| > machines (because I haven't gotten around to upgrading the few i386 I
| > still have running yet).  They're 'real' and virtual (KVM) machines.
| 
| I saw it on two amd64 hosts. I have i386 hosts (but the config is
| differents: use local servers).
| 
| > Has anyone else seen this behavior?
| 
| yes.
| 
| I also tried to run with kern.nosuidcoredump=3 but the terminating don't
| generated any coredump in /var/crash/ntpd.
| 
| the process terminaison isn't automatic: my ntpd is now running fine
| since almost 1 day.
| -- 
| Sebastien Marie
| 

-- 
>++++++++[<++++++++++>-]<+++++++.>+++[<------>-]<.>+++[<+
+++++++++++>-]<.>++[<------------>-]<+.--------------.[-]
                 http://www.weirdnet.nl/

Re: ntpd failing during runtime without indication why

Reply via email to