Re: fix very small ntpd leak

2022-03-24 Thread Otto Moerbeek
On Thu, Mar 24, 2022 at 10:34:52AM +1000, Jonathan Matthew wrote:

> On Wed, Mar 23, 2022 at 04:59:06PM +0100, Otto Moerbeek wrote:
> > On Wed, Mar 23, 2022 at 09:09:01PM +1000, Jonathan Matthew wrote:
> > 
> > > We noticed that the ntpd engine process was getting a bit big on some 
> > > boxes
> > > that we'd accidentally cut off from the ntp servers (routing is hard).
> > > Reading through the code, I noticed the 'query' member of struct ntp_peer
> > > is never freed, which seems to account for the leak.
> > > 
> > > If you have a server pool in ntpd.conf and it resolves, but ntpd is unable
> > > to talk to the servers, it will re-resolve periodically, freeing the old 
> > > list
> > > of peers and creating new ones.
> > > 
> > > To show how slow the leak is, here's the leak report from MALLOC_OPTIONS=D
> > > after running for about two hours with four servers from two pools.
> > > 
> > > without diff:
> > >  
> > > Leak report
> > >  f sum  #avg
> > >0x09392128 73
> > >  0x889878b920b 512  1512
> > >  0x889878bc8e14096  4   1024
> > >  0x889878bd065 128  2 64
> > >  0x88bc91f0b4b   18280  1  18280
> > >  0x88bc926a9ed   65536  1  65536
> > >  
> > >  
> > > with diff:
> > >  
> > > Leak report
> > >  f sum  #avg
> > >0x06064 16379
> > >  0xbee1253320b 512  1512
> > >  0xbf0265f4b4b   18280  1  18280
> > >  0xbf02666e9ed   65536  1  65536
> > > 
> > > ok?
> > > 
> > > Index: ntp.c
> > > ===
> > > RCS file: /cvs/src/usr.sbin/ntpd/ntp.c,v
> > > retrieving revision 1.168
> > > diff -u -p -r1.168 ntp.c
> > > --- ntp.c 24 Oct 2021 21:24:19 -  1.168
> > > +++ ntp.c 23 Mar 2022 10:43:59 -
> > > @@ -686,6 +686,7 @@ void
> > >  peer_remove(struct ntp_peer *p)
> > >  {
> > >   TAILQ_REMOVE(>ntp_peers, p, entry);
> > > + free(p->query);
> > >   free(p);
> > >   peer_cnt--;
> > >  }
> > > 
> > 
> > This is a bug that dlg reported last week. Serendepity or not? :-)
> 
> We found it together looking at systems we run at work, so not really.

heh, dlg was talking about a collegue but i did not connnect that to you,

anyway, thanks for the analysis,

-Otto

> 
> > 
> > This is my diff that uses an approach I like a litle bit better.
> 
> I agree.  I wasn't sure if there was a reason the query was allocated
> separately, so I went with the more straightforward diff to start with.
> 
> > 
> > -Otto
> > 
> > Index: client.c
> > ===
> > RCS file: /cvs/src/usr.sbin/ntpd/client.c,v
> > retrieving revision 1.116
> > diff -u -p -r1.116 client.c
> > --- client.c21 Apr 2021 09:38:11 -  1.116
> > +++ client.c21 Mar 2022 07:31:54 -
> > @@ -51,10 +51,9 @@ set_deadline(struct ntp_peer *p, time_t 
> >  int
> >  client_peer_init(struct ntp_peer *p)
> >  {
> > -   if ((p->query = calloc(1, sizeof(struct ntp_query))) == NULL)
> > -   fatal("client_peer_init calloc");
> > -   p->query->fd = -1;
> > -   p->query->msg.status = MODE_CLIENT | (NTP_VERSION << 3);
> > +   p->query.fd = -1;
> > +   p->query.msg.status = MODE_CLIENT | (NTP_VERSION << 3);
> > +   p->query.xmttime = 0;
> > p->state = STATE_NONE;
> > p->shift = 0;
> > p->trustlevel = TRUSTLEVEL_PATHETIC;
> > @@ -91,7 +90,7 @@ client_addr_init(struct ntp_peer *p)
> > }
> > }
> >  
> > -   p->query->fd = -1;
> > +   p->query.fd = -1;
> > set_next(p, 0);
> >  
> > return (0);
> > @@ -100,9 +99,9 @@ client_addr_init(struct ntp_peer *p)
> >  int
> >  client_nextaddr(struct ntp_peer *p)
> >  {
> > -   if (p->query->fd != -1) {
> > -   close(p->query->fd);
> > -   p->query->fd = -1;
> > +   if (p->query.fd != -1) {
> > +   close(p->query.fd);
> > +   p->query.fd = -1;
> > }
> >  
> > if (p->state == STATE_DNS_INPROGRESS)
> > @@ -148,26 +147,26 @@ client_query(struct ntp_peer *p)
> > if (p->state < STATE_DNS_DONE || p->addr == NULL)
> > return (-1);
> >  
> > -   if (p->query->fd == -1) {
> > +   if (p->query.fd == -1) {
> > struct sockaddr *sa = (struct sockaddr *)>addr->ss;
> > struct sockaddr *qa4 = (struct sockaddr *)>query_addr4;
> > struct sockaddr *qa6 = (struct sockaddr *)>query_addr6;
> >  
> > -   if ((p->query->fd = socket(p->addr->ss.ss_family, SOCK_DGRAM,
> > +   if ((p->query.fd = socket(p->addr->ss.ss_family, SOCK_DGRAM,
> > 0)) == -1)
> > fatal("client_query socket");
> >  
> > if (p->addr->ss.ss_family == qa4->sa_family) {
> > -   if (bind(p->query->fd, qa4, SA_LEN(qa4)) == -1)
> > +   if (bind(p->query.fd, qa4, SA_LEN(qa4)) == -1)
> > 

Re: fix very small ntpd leak

2022-03-23 Thread Jonathan Matthew
On Wed, Mar 23, 2022 at 04:59:06PM +0100, Otto Moerbeek wrote:
> On Wed, Mar 23, 2022 at 09:09:01PM +1000, Jonathan Matthew wrote:
> 
> > We noticed that the ntpd engine process was getting a bit big on some boxes
> > that we'd accidentally cut off from the ntp servers (routing is hard).
> > Reading through the code, I noticed the 'query' member of struct ntp_peer
> > is never freed, which seems to account for the leak.
> > 
> > If you have a server pool in ntpd.conf and it resolves, but ntpd is unable
> > to talk to the servers, it will re-resolve periodically, freeing the old 
> > list
> > of peers and creating new ones.
> > 
> > To show how slow the leak is, here's the leak report from MALLOC_OPTIONS=D
> > after running for about two hours with four servers from two pools.
> > 
> > without diff:
> >  
> > Leak report
> >  f sum  #avg
> >0x09392128 73
> >  0x889878b920b 512  1512
> >  0x889878bc8e14096  4   1024
> >  0x889878bd065 128  2 64
> >  0x88bc91f0b4b   18280  1  18280
> >  0x88bc926a9ed   65536  1  65536
> >  
> >  
> > with diff:
> >  
> > Leak report
> >  f sum  #avg
> >0x06064 16379
> >  0xbee1253320b 512  1512
> >  0xbf0265f4b4b   18280  1  18280
> >  0xbf02666e9ed   65536  1  65536
> > 
> > ok?
> > 
> > Index: ntp.c
> > ===
> > RCS file: /cvs/src/usr.sbin/ntpd/ntp.c,v
> > retrieving revision 1.168
> > diff -u -p -r1.168 ntp.c
> > --- ntp.c   24 Oct 2021 21:24:19 -  1.168
> > +++ ntp.c   23 Mar 2022 10:43:59 -
> > @@ -686,6 +686,7 @@ void
> >  peer_remove(struct ntp_peer *p)
> >  {
> > TAILQ_REMOVE(>ntp_peers, p, entry);
> > +   free(p->query);
> > free(p);
> > peer_cnt--;
> >  }
> > 
> 
> This is a bug that dlg reported last week. Serendepity or not? :-)

We found it together looking at systems we run at work, so not really.

> 
> This is my diff that uses an approach I like a litle bit better.

I agree.  I wasn't sure if there was a reason the query was allocated
separately, so I went with the more straightforward diff to start with.

> 
>   -Otto
> 
> Index: client.c
> ===
> RCS file: /cvs/src/usr.sbin/ntpd/client.c,v
> retrieving revision 1.116
> diff -u -p -r1.116 client.c
> --- client.c  21 Apr 2021 09:38:11 -  1.116
> +++ client.c  21 Mar 2022 07:31:54 -
> @@ -51,10 +51,9 @@ set_deadline(struct ntp_peer *p, time_t 
>  int
>  client_peer_init(struct ntp_peer *p)
>  {
> - if ((p->query = calloc(1, sizeof(struct ntp_query))) == NULL)
> - fatal("client_peer_init calloc");
> - p->query->fd = -1;
> - p->query->msg.status = MODE_CLIENT | (NTP_VERSION << 3);
> + p->query.fd = -1;
> + p->query.msg.status = MODE_CLIENT | (NTP_VERSION << 3);
> + p->query.xmttime = 0;
>   p->state = STATE_NONE;
>   p->shift = 0;
>   p->trustlevel = TRUSTLEVEL_PATHETIC;
> @@ -91,7 +90,7 @@ client_addr_init(struct ntp_peer *p)
>   }
>   }
>  
> - p->query->fd = -1;
> + p->query.fd = -1;
>   set_next(p, 0);
>  
>   return (0);
> @@ -100,9 +99,9 @@ client_addr_init(struct ntp_peer *p)
>  int
>  client_nextaddr(struct ntp_peer *p)
>  {
> - if (p->query->fd != -1) {
> - close(p->query->fd);
> - p->query->fd = -1;
> + if (p->query.fd != -1) {
> + close(p->query.fd);
> + p->query.fd = -1;
>   }
>  
>   if (p->state == STATE_DNS_INPROGRESS)
> @@ -148,26 +147,26 @@ client_query(struct ntp_peer *p)
>   if (p->state < STATE_DNS_DONE || p->addr == NULL)
>   return (-1);
>  
> - if (p->query->fd == -1) {
> + if (p->query.fd == -1) {
>   struct sockaddr *sa = (struct sockaddr *)>addr->ss;
>   struct sockaddr *qa4 = (struct sockaddr *)>query_addr4;
>   struct sockaddr *qa6 = (struct sockaddr *)>query_addr6;
>  
> - if ((p->query->fd = socket(p->addr->ss.ss_family, SOCK_DGRAM,
> + if ((p->query.fd = socket(p->addr->ss.ss_family, SOCK_DGRAM,
>   0)) == -1)
>   fatal("client_query socket");
>  
>   if (p->addr->ss.ss_family == qa4->sa_family) {
> - if (bind(p->query->fd, qa4, SA_LEN(qa4)) == -1)
> + if (bind(p->query.fd, qa4, SA_LEN(qa4)) == -1)
>   fatal("couldn't bind to IPv4 query address: %s",
>   log_sockaddr(qa4));
>   } else if (p->addr->ss.ss_family == qa6->sa_family) {
> - if (bind(p->query->fd, qa6, SA_LEN(qa6)) == -1)
> + if (bind(p->query.fd, qa6, SA_LEN(qa6)) == -1)
>   fatal("couldn't bind to IPv6 query 

Re: fix very small ntpd leak

2022-03-23 Thread Todd C . Miller
On Wed, 23 Mar 2022 16:59:06 +0100, Otto Moerbeek wrote:

> This is a bug that dlg reported last week. Serendepity or not? :-)
>
> This is my diff that uses an approach I like a litle bit better.

Since client_peer_init() is always called after new_peer() there's
no reason I can see for query to be a pointer.  You diff seems
better to me too.

 - todd



Re: fix very small ntpd leak

2022-03-23 Thread Otto Moerbeek
On Wed, Mar 23, 2022 at 09:09:01PM +1000, Jonathan Matthew wrote:

> We noticed that the ntpd engine process was getting a bit big on some boxes
> that we'd accidentally cut off from the ntp servers (routing is hard).
> Reading through the code, I noticed the 'query' member of struct ntp_peer
> is never freed, which seems to account for the leak.
> 
> If you have a server pool in ntpd.conf and it resolves, but ntpd is unable
> to talk to the servers, it will re-resolve periodically, freeing the old list
> of peers and creating new ones.
> 
> To show how slow the leak is, here's the leak report from MALLOC_OPTIONS=D
> after running for about two hours with four servers from two pools.
> 
> without diff:
>  
> Leak report
>  f sum  #avg
>0x09392128 73
>  0x889878b920b 512  1512
>  0x889878bc8e14096  4   1024
>  0x889878bd065 128  2 64
>  0x88bc91f0b4b   18280  1  18280
>  0x88bc926a9ed   65536  1  65536
>  
>  
> with diff:
>  
> Leak report
>  f sum  #avg
>0x06064 16379
>  0xbee1253320b 512  1512
>  0xbf0265f4b4b   18280  1  18280
>  0xbf02666e9ed   65536  1  65536
> 
> ok?
> 
> Index: ntp.c
> ===
> RCS file: /cvs/src/usr.sbin/ntpd/ntp.c,v
> retrieving revision 1.168
> diff -u -p -r1.168 ntp.c
> --- ntp.c 24 Oct 2021 21:24:19 -  1.168
> +++ ntp.c 23 Mar 2022 10:43:59 -
> @@ -686,6 +686,7 @@ void
>  peer_remove(struct ntp_peer *p)
>  {
>   TAILQ_REMOVE(>ntp_peers, p, entry);
> + free(p->query);
>   free(p);
>   peer_cnt--;
>  }
> 

This is a bug that dlg reported last week. Serendepity or not? :-)

This is my diff that uses an approach I like a litle bit better.

-Otto

Index: client.c
===
RCS file: /cvs/src/usr.sbin/ntpd/client.c,v
retrieving revision 1.116
diff -u -p -r1.116 client.c
--- client.c21 Apr 2021 09:38:11 -  1.116
+++ client.c21 Mar 2022 07:31:54 -
@@ -51,10 +51,9 @@ set_deadline(struct ntp_peer *p, time_t 
 int
 client_peer_init(struct ntp_peer *p)
 {
-   if ((p->query = calloc(1, sizeof(struct ntp_query))) == NULL)
-   fatal("client_peer_init calloc");
-   p->query->fd = -1;
-   p->query->msg.status = MODE_CLIENT | (NTP_VERSION << 3);
+   p->query.fd = -1;
+   p->query.msg.status = MODE_CLIENT | (NTP_VERSION << 3);
+   p->query.xmttime = 0;
p->state = STATE_NONE;
p->shift = 0;
p->trustlevel = TRUSTLEVEL_PATHETIC;
@@ -91,7 +90,7 @@ client_addr_init(struct ntp_peer *p)
}
}
 
-   p->query->fd = -1;
+   p->query.fd = -1;
set_next(p, 0);
 
return (0);
@@ -100,9 +99,9 @@ client_addr_init(struct ntp_peer *p)
 int
 client_nextaddr(struct ntp_peer *p)
 {
-   if (p->query->fd != -1) {
-   close(p->query->fd);
-   p->query->fd = -1;
+   if (p->query.fd != -1) {
+   close(p->query.fd);
+   p->query.fd = -1;
}
 
if (p->state == STATE_DNS_INPROGRESS)
@@ -148,26 +147,26 @@ client_query(struct ntp_peer *p)
if (p->state < STATE_DNS_DONE || p->addr == NULL)
return (-1);
 
-   if (p->query->fd == -1) {
+   if (p->query.fd == -1) {
struct sockaddr *sa = (struct sockaddr *)>addr->ss;
struct sockaddr *qa4 = (struct sockaddr *)>query_addr4;
struct sockaddr *qa6 = (struct sockaddr *)>query_addr6;
 
-   if ((p->query->fd = socket(p->addr->ss.ss_family, SOCK_DGRAM,
+   if ((p->query.fd = socket(p->addr->ss.ss_family, SOCK_DGRAM,
0)) == -1)
fatal("client_query socket");
 
if (p->addr->ss.ss_family == qa4->sa_family) {
-   if (bind(p->query->fd, qa4, SA_LEN(qa4)) == -1)
+   if (bind(p->query.fd, qa4, SA_LEN(qa4)) == -1)
fatal("couldn't bind to IPv4 query address: %s",
log_sockaddr(qa4));
} else if (p->addr->ss.ss_family == qa6->sa_family) {
-   if (bind(p->query->fd, qa6, SA_LEN(qa6)) == -1)
+   if (bind(p->query.fd, qa6, SA_LEN(qa6)) == -1)
fatal("couldn't bind to IPv6 query address: %s",
log_sockaddr(qa6));
}
 
-   if (connect(p->query->fd, sa, SA_LEN(sa)) == -1) {
+   if (connect(p->query.fd, sa, SA_LEN(sa)) == -1) {
if (errno == ECONNREFUSED || errno == ENETUNREACH ||
errno == EHOSTUNREACH || errno == EADDRNOTAVAIL) {
/* cycle through 

Re: fix very small ntpd leak

2022-03-23 Thread Alexander Bluhm
On Wed, Mar 23, 2022 at 09:09:01PM +1000, Jonathan Matthew wrote:
> We noticed that the ntpd engine process was getting a bit big on some boxes
> that we'd accidentally cut off from the ntp servers (routing is hard).
> Reading through the code, I noticed the 'query' member of struct ntp_peer
> is never freed, which seems to account for the leak.
> 
> If you have a server pool in ntpd.conf and it resolves, but ntpd is unable
> to talk to the servers, it will re-resolve periodically, freeing the old list
> of peers and creating new ones.
> 
> To show how slow the leak is, here's the leak report from MALLOC_OPTIONS=D
> after running for about two hours with four servers from two pools.
> 
> without diff:
>  
> Leak report
>  f sum  #avg
>0x09392128 73
>  0x889878b920b 512  1512
>  0x889878bc8e14096  4   1024
>  0x889878bd065 128  2 64
>  0x88bc91f0b4b   18280  1  18280
>  0x88bc926a9ed   65536  1  65536
>  
>  
> with diff:
>  
> Leak report
>  f sum  #avg
>0x06064 16379
>  0xbee1253320b 512  1512
>  0xbf0265f4b4b   18280  1  18280
>  0xbf02666e9ed   65536  1  65536
> 
> ok?

OK bluhm@

> Index: ntp.c
> ===
> RCS file: /cvs/src/usr.sbin/ntpd/ntp.c,v
> retrieving revision 1.168
> diff -u -p -r1.168 ntp.c
> --- ntp.c 24 Oct 2021 21:24:19 -  1.168
> +++ ntp.c 23 Mar 2022 10:43:59 -
> @@ -686,6 +686,7 @@ void
>  peer_remove(struct ntp_peer *p)
>  {
>   TAILQ_REMOVE(>ntp_peers, p, entry);
> + free(p->query);
>   free(p);
>   peer_cnt--;
>  }



fix very small ntpd leak

2022-03-23 Thread Jonathan Matthew
We noticed that the ntpd engine process was getting a bit big on some boxes
that we'd accidentally cut off from the ntp servers (routing is hard).
Reading through the code, I noticed the 'query' member of struct ntp_peer
is never freed, which seems to account for the leak.

If you have a server pool in ntpd.conf and it resolves, but ntpd is unable
to talk to the servers, it will re-resolve periodically, freeing the old list
of peers and creating new ones.

To show how slow the leak is, here's the leak report from MALLOC_OPTIONS=D
after running for about two hours with four servers from two pools.

without diff:
 
Leak report
 f sum  #avg
   0x09392128 73
 0x889878b920b 512  1512
 0x889878bc8e14096  4   1024
 0x889878bd065 128  2 64
 0x88bc91f0b4b   18280  1  18280
 0x88bc926a9ed   65536  1  65536
 
 
with diff:
 
Leak report
 f sum  #avg
   0x06064 16379
 0xbee1253320b 512  1512
 0xbf0265f4b4b   18280  1  18280
 0xbf02666e9ed   65536  1  65536

ok?

Index: ntp.c
===
RCS file: /cvs/src/usr.sbin/ntpd/ntp.c,v
retrieving revision 1.168
diff -u -p -r1.168 ntp.c
--- ntp.c   24 Oct 2021 21:24:19 -  1.168
+++ ntp.c   23 Mar 2022 10:43:59 -
@@ -686,6 +686,7 @@ void
 peer_remove(struct ntp_peer *p)
 {
TAILQ_REMOVE(>ntp_peers, p, entry);
+   free(p->query);
free(p);
peer_cnt--;
 }