On Tue, Oct 18, 2016 at 10:35:44PM +0200, Paul de Weerd wrote:
> I had some more time to look into this.  I've upgraded several
> machines to more recent snapshots (which includes rzalamena's commit
> to util.c and reyk's commit to log.c) but still see this.  It happens
> on 'real' amd64 machines, KVM vm's running amd64 and an i386 machine
> too.
> I've now ktraced with -i next to -d (-i was missing previously, so the
> traces weren't very useful - apologies).  The ktrace and kdump output
> are rather large (60.2M for the ktrace, 58.5M for the kdump) for the
> i386 machine that was first to hit it again after running under
> ktrace, so I'm not attaching those (happy to make those available on
> request).

Thank you very much Paul, but your ktrace snippets actually shows one
mistake of mine.

> I was looking for "exit" in the kdump output and I see regularly
> occuring exits (from what looks like the constraint process).
> However, at some point there's this (kdump | cat -n):

The ntp engine is receiving a SIGTERM probably from the parent process,
so the other process are exiting because the socket between processes closed.

After reviewing the code to find out how it is possible that the
"ntp engine" is receiving a SIGTERM, I noticed that the start_child()
in constraint is not saving the pid value. When you hit a bad network
connection and your constraint process takes too much time, the ntp
engine is asking the parent process to kill it and since I didn't set
the constraint process pid, parent will try to kill(0, SIGTERM) which
means kill all process in group.

While I write diffs for tech@, here is a diff you can use to see if it
solves your problem:

Index: constraint.c
RCS file: /home/obsdcvs/src/usr.sbin/ntpd/constraint.c,v
retrieving revision 1.32
diff -u -p -r1.32 constraint.c
--- constraint.c        26 Sep 2016 17:17:01 -0000      1.32
+++ constraint.c        18 Oct 2016 23:03:16 -0000
@@ -217,6 +217,7 @@ priv_constraint_msg(u_int32_t id, u_int8
        struct ntp_addr         *h;
        struct constraint       *cstr;
        int                      pipes[2];
+       int                      rv;
        if ((cstr = constraint_byid(id)) != NULL) {
                log_warnx("IMSG_CONSTRAINT_QUERY repeated for id %d", id);
@@ -256,15 +257,18 @@ priv_constraint_msg(u_int32_t id, u_int8
        if (imsg_compose(&cstr->ibuf, IMSG_CONSTRAINT_QUERY, id, 0, -1,
            data, len) == -1)
                fatal("%s: imsg_compose", __func__);
-       if (imsg_flush(&cstr->ibuf) == -1)
-               fatal("%s: imsg_flush", __func__);
+       do {
+               rv = imsg_flush(&cstr->ibuf);
+       } while (rv == -1 && errno == EAGAIN);
+       if (rv == -1)
+               fatal("imsg_flush");
         * Fork child handlers and make sure to do any sensitive work in the
         * the (unprivileged) child.  The parent should not do any parsing,
         * certificate loading etc.
-       start_child(CONSTRAINT_PROC_NAME, pipes[1], argc, argv);
+       cstr->pid = start_child(CONSTRAINT_PROC_NAME, pipes[1], argc, argv);
@@ -325,7 +329,7 @@ priv_constraint_child(const char *pw_dir
        struct sigaction         sa;
        void                    *ctx;
        struct iovec             iov[2];
-       int                      i;
+       int                      i, rv;
@@ -420,7 +424,9 @@ priv_constraint_child(const char *pw_dir
        iov[1].iov_len = sizeof(xmttv);
            IMSG_CONSTRAINT_RESULT, 0, 0, -1, iov, 2);
-       imsg_flush(&cstr->ibuf);
+       do {
+               rv = imsg_flush(&cstr->ibuf);
+       } while (rv == -1 && errno == EAGAIN);
        /* Tear down the TLS connection after sending the result */

Reply via email to