Re: [Dnsmasq-discuss] [PATCH] Refuse to start with EADDRINUSE in --bind-dynamic mode

2023-11-27 Thread Simon Kelley



On 25/11/2023 16:51, Petr Menšík wrote:
Yes, the problem is 3) has a condition we wait until it changes then 
retry. But for a lot (most?) of errors we lack any indication from the 
system it has changed.


For example insufficient memory or insufficient file descriptors. It may 
change, but unlike watching up and down interfaces, there is no hook 
which would retry listener creation. It fails once and then just maybe 
retries on explicit reload. That is why I think it is absolutely 
necessary to log any failure we pass somewhere, unless we know we will 
retry later.


You're right, the only error from bind() that should be ignored is 
EADDRNOTAVAIL. everything else should be a fatal error during startup or 
logged once  the daemon is running.


I've just pushed a patch to that effect.

Cheers,

Simon.



More below...

On 11/23/23 13:47, Simon Kelley wrote:
That's a good point, but I don't think there needs to be any non-fatal 
error logging. There are three situations during startup.


1) bind() succeeds.
2) bind fails for a reason which won't change - fatal error.
3) bind fails for a reason which may change - startup and wait until 
it does change and try again.


The canonical example of 3) is the one I gave before, 
--listen-address=1.2.3.4 but not local interface has address 1.2.3.4. 
The intention is that when a new interface comes up with address 
1.2.3.4 then a new socket will be created and bound. This is long 
after startup, so the only option if it fails then is to log the event.
Of course, this is very special case somehow well handled. I agree there 
is not much else to do. We could only make the error fatal, but I don't 
think that is desired.


If the only situation where we want to wait is the one above, then the 
solution to to make EADDRNOTAVAIL at startup the only one where we 
keep waiting, and all the others are fatal. I think when I originally 
wrote this I wasn't sure if that was the only non-fatal error which is 
why the code is as it is.


This is not a complete solution to your original problem of enforcing 
only one dnsmasq daemon process in any case. For example if you 
configure a single listen-address which doesn't exist on the machine, 
then you can start as many dnsmasq processes as you like and they'll 
all start up and be waiting for the interface with that address to be 
created. Once it is, all will try and bind it, and all but one will 
fail, but they'll all still exist. Managing daemon processes is really 
the job of sysvinit or systemd, but the authors of the bug seem to 
sant protection from just running the binary from the command line.


We at Fedora support only services managed by systemd. But even for 
that, it needs to get some feedback of failure. If the process 
terminates with non-zero status code, unit will be marked failed. We 
*need* that. Alternative might be support for libsystemd with notify 
socket, which would work with Type=notify services. Now it will report 
failed startup only with Type=forking. Later failure is logged only as a 
warning regardless of type of the error. I think we want unexpected 
error types to be logged as errors, especially for insufficient 
resources errors like ENOMEM. Or made them fatal. With systemd unit 
Restart=on-failure, it might be able to recover from memory leaks if 
such errors were fatal. Not sure we want that, might break a lot of 
deployments, but also fix some.




TLDR;

We either pick a set of errors which are Ok to continue 
(EADDRNOTAVAIL, what others?) and fail fatally at startup for all 
others, or we pick a set of errors to fail fatally at startup 
(EADDRINUSE, EACCESS, what others?) and continue for all others.



Cheers,

Simon.


I would say safer would be to fatal error everything except explicitly 
waived, for now just EADDRNOTAVAIL and EINTR? I think most of these 
errors means incomplete degraded service anyway, without reliable 
self-repair code present. If it had repeat timer with exponentially 
increasing time of retry (with some upper bound), then we might want it 
to start anyway. But I think it is safer to prevent half-initialized 
service. Systemd can provide autorecovery with smart settings. Do we 
have a way to specify I do not require TCP listening socket for DNS? It 
should be clearly discouraged, but for some kinds of tests it might be 
acceptable.


Cheers,
Petr




On 23/11/2023 11:13, Petr Menšík wrote:
To fix problem with multiple instances correctly refusing running on 
the same machine and namespaces, yes, it would be sufficient.


But I think part of the problem is hiding all problems during startup 
and not showing them at all, in any source. I think that is okay for 
EADDRNOTAVAIL to not be printed. But I think in other cases we want 
at least warning somewhere. This way you also get exact error message 
printed. For example selinux policy hardening may prevent your 
process to listen on port 53, even though it has NET_BIND_SERVICE.


With my modification it will print 

Re: [Dnsmasq-discuss] [PATCH] Refuse to start with EADDRINUSE in --bind-dynamic mode

2023-11-25 Thread Petr Menšík
Yes, the problem is 3) has a condition we wait until it changes then 
retry. But for a lot (most?) of errors we lack any indication from the 
system it has changed.


For example insufficient memory or insufficient file descriptors. It may 
change, but unlike watching up and down interfaces, there is no hook 
which would retry listener creation. It fails once and then just maybe 
retries on explicit reload. That is why I think it is absolutely 
necessary to log any failure we pass somewhere, unless we know we will 
retry later.


More below...

On 11/23/23 13:47, Simon Kelley wrote:
That's a good point, but I don't think there needs to be any non-fatal 
error logging. There are three situations during startup.


1) bind() succeeds.
2) bind fails for a reason which won't change - fatal error.
3) bind fails for a reason which may change - startup and wait until 
it does change and try again.


The canonical example of 3) is the one I gave before, 
--listen-address=1.2.3.4 but not local interface has address 1.2.3.4. 
The intention is that when a new interface comes up with address 
1.2.3.4 then a new socket will be created and bound. This is long 
after startup, so the only option if it fails then is to log the event.
Of course, this is very special case somehow well handled. I agree there 
is not much else to do. We could only make the error fatal, but I don't 
think that is desired.


If the only situation where we want to wait is the one above, then the 
solution to to make EADDRNOTAVAIL at startup the only one where we 
keep waiting, and all the others are fatal. I think when I originally 
wrote this I wasn't sure if that was the only non-fatal error which is 
why the code is as it is.


This is not a complete solution to your original problem of enforcing 
only one dnsmasq daemon process in any case. For example if you 
configure a single listen-address which doesn't exist on the machine, 
then you can start as many dnsmasq processes as you like and they'll 
all start up and be waiting for the interface with that address to be 
created. Once it is, all will try and bind it, and all but one will 
fail, but they'll all still exist. Managing daemon processes is really 
the job of sysvinit or systemd, but the authors of the bug seem to 
sant protection from just running the binary from the command line.


We at Fedora support only services managed by systemd. But even for 
that, it needs to get some feedback of failure. If the process 
terminates with non-zero status code, unit will be marked failed. We 
*need* that. Alternative might be support for libsystemd with notify 
socket, which would work with Type=notify services. Now it will report 
failed startup only with Type=forking. Later failure is logged only as a 
warning regardless of type of the error. I think we want unexpected 
error types to be logged as errors, especially for insufficient 
resources errors like ENOMEM. Or made them fatal. With systemd unit 
Restart=on-failure, it might be able to recover from memory leaks if 
such errors were fatal. Not sure we want that, might break a lot of 
deployments, but also fix some.




TLDR;

We either pick a set of errors which are Ok to continue 
(EADDRNOTAVAIL, what others?) and fail fatally at startup for all 
others, or we pick a set of errors to fail fatally at startup 
(EADDRINUSE, EACCESS, what others?) and continue for all others.



Cheers,

Simon.


I would say safer would be to fatal error everything except explicitly 
waived, for now just EADDRNOTAVAIL and EINTR? I think most of these 
errors means incomplete degraded service anyway, without reliable 
self-repair code present. If it had repeat timer with exponentially 
increasing time of retry (with some upper bound), then we might want it 
to start anyway. But I think it is safer to prevent half-initialized 
service. Systemd can provide autorecovery with smart settings. Do we 
have a way to specify I do not require TCP listening socket for DNS? It 
should be clearly discouraged, but for some kinds of tests it might be 
acceptable.


Cheers,
Petr




On 23/11/2023 11:13, Petr Menšík wrote:
To fix problem with multiple instances correctly refusing running on 
the same machine and namespaces, yes, it would be sufficient.


But I think part of the problem is hiding all problems during startup 
and not showing them at all, in any source. I think that is okay for 
EADDRNOTAVAIL to not be printed. But I think in other cases we want 
at least warning somewhere. This way you also get exact error message 
printed. For example selinux policy hardening may prevent your 
process to listen on port 53, even though it has NET_BIND_SERVICE.


With my modification it will print errors for listeners used. Note 
10.1.2.3 is hidden at that phase. You would not know it without 
strace analysis. I expect there can be different errors, for example 
running out of file descriptors or memory. Hiding something 
non-standard happened during startup is quite a bad 

Re: [Dnsmasq-discuss] [PATCH] Refuse to start with EADDRINUSE in --bind-dynamic mode

2023-11-23 Thread Simon Kelley
That's a good point, but I don't think there needs to be any non-fatal 
error logging. There are three situations during startup.


1) bind() succeeds.
2) bind fails for a reason which won't change - fatal error.
3) bind fails for a reason which may change - startup and wait until it 
does change and try again.


The canonical example of 3) is the one I gave before, 
--listen-address=1.2.3.4 but not local interface has address 1.2.3.4. 
The intention is that when a new interface comes up with address 1.2.3.4 
then a new socket will be created and bound. This is long after startup, 
so the only option if it fails then is to log the event.


If the only situation where we want to wait is the one above, then the 
solution to to make EADDRNOTAVAIL at startup the only one where we keep 
waiting, and all the others are fatal. I think when I originally wrote 
this I wasn't sure if that was the only non-fatal error which is why the 
code is as it is.


This is not a complete solution to your original problem of enforcing 
only one dnsmasq daemon process in any case. For example if you 
configure a single listen-address which doesn't exist on the machine, 
then you can start as many dnsmasq processes as you like and they'll all 
start up and be waiting for the interface with that address to be 
created. Once it is, all will try and bind it, and all but one will 
fail, but they'll all still exist. Managing daemon processes is really 
the job of sysvinit or systemd, but the authors of the bug seem to sant 
protection from just running the binary from the command line.


TLDR;

We either pick a set of errors which are Ok to continue (EADDRNOTAVAIL, 
what others?) and fail fatally at startup for all others, or we pick a 
set of errors to fail fatally at startup (EADDRINUSE, EACCESS, what 
others?) and continue for all others.



Cheers,

Simon.


On 23/11/2023 11:13, Petr Menšík wrote:
To fix problem with multiple instances correctly refusing running on the 
same machine and namespaces, yes, it would be sufficient.


But I think part of the problem is hiding all problems during startup 
and not showing them at all, in any source. I think that is okay for 
EADDRNOTAVAIL to not be printed. But I think in other cases we want at 
least warning somewhere. This way you also get exact error message 
printed. For example selinux policy hardening may prevent your process 
to listen on port 53, even though it has NET_BIND_SERVICE.


With my modification it will print errors for listeners used. Note 
10.1.2.3 is hidden at that phase. You would not know it without strace 
analysis. I expect there can be different errors, for example running 
out of file descriptors or memory. Hiding something non-standard 
happened during startup is quite a bad design. Only some kinds of errors 
are okay during startup.


$ sudo -u nobody fedora-like/dnsmasq -d --bind-dynamic 
--listen-address=10.1.2.3

dnsmasq: failed to create listening socket for 127.0.0.1: Permission denied
dnsmasq: failed to create listening socket for 127.0.0.1: Permission denied
dnsmasq: failed to create listening socket for ::1: Permission denied
dnsmasq: failed to create listening socket for ::1: Permission denied

dnsmasq: process is missing required capability NET_BIND_SERVICE

# Compare this with:

$ sudo -u nobody fedora-like/dnsmasq -d --bind-interfaces 
--listen-address=10.1.2.3

dnsmasq: failed to create listening socket for 127.0.0.1: Permission denied

I think we want any errors printed, even if they are not made fatal. 
Except carefully chosen type of errors, which are expected and would 
raise just false alarms. Not sure how to trigger other types of errors, 
but I am sure I would like to see them, even if they did not cause the 
process to die. That is why I have used more complicated approach, which 
should print everything unexpected, even when dnsmasq is not stopped. In 
order to investigate you first have to know something unusual has happened.


On 23. 11. 23 0:29, Simon Kelley wrote:

Isn't this sufficient to fix the problem?

Not calling die() when bind-dynamic is set is intended to handle the 
case that  bind returns EADDRNOTAVAIL because you've configured 
--listen-address=1.2.3.4 but there's not a local interface with that 
address. dnsmasq runs anyway in the expectation that such an interface 
will appear in the future and a socket will be bound then.


I don't think there's a die()/syslog() conflict at all.


diff --git a/src/network.c b/src/network.c
index ca9fada..db1d528 100644
--- a/src/network.c
+++ b/src/network.c
@@ -924,7 +924,7 @@ static int make_sock(union mysockaddr *addr, int 
type, int dienow)

    {
  /* failure to bind addresses given by --listen-address at 
this point

 is OK if we're doing bind-dynamic */
- if (!option_bool(OPT_CLEVERBIND))
+ if (!option_bool(OPT_CLEVERBIND) || errno == EADDRINUSE)
    die(s, daemon->addrbuff, EC_BADNET);
    }
   else

Cheers,

Simon.

On 

Re: [Dnsmasq-discuss] [PATCH] Refuse to start with EADDRINUSE in --bind-dynamic mode

2023-11-23 Thread Petr Menšík
To fix problem with multiple instances correctly refusing running on the 
same machine and namespaces, yes, it would be sufficient.


But I think part of the problem is hiding all problems during startup 
and not showing them at all, in any source. I think that is okay for 
EADDRNOTAVAIL to not be printed. But I think in other cases we want at 
least warning somewhere. This way you also get exact error message 
printed. For example selinux policy hardening may prevent your process 
to listen on port 53, even though it has NET_BIND_SERVICE.


With my modification it will print errors for listeners used. Note 
10.1.2.3 is hidden at that phase. You would not know it without strace 
analysis. I expect there can be different errors, for example running 
out of file descriptors or memory. Hiding something non-standard 
happened during startup is quite a bad design. Only some kinds of errors 
are okay during startup.


$ sudo -u nobody fedora-like/dnsmasq -d --bind-dynamic 
--listen-address=10.1.2.3

dnsmasq: failed to create listening socket for 127.0.0.1: Permission denied
dnsmasq: failed to create listening socket for 127.0.0.1: Permission denied
dnsmasq: failed to create listening socket for ::1: Permission denied
dnsmasq: failed to create listening socket for ::1: Permission denied

dnsmasq: process is missing required capability NET_BIND_SERVICE

# Compare this with:

$ sudo -u nobody fedora-like/dnsmasq -d --bind-interfaces 
--listen-address=10.1.2.3

dnsmasq: failed to create listening socket for 127.0.0.1: Permission denied

I think we want any errors printed, even if they are not made fatal. 
Except carefully chosen type of errors, which are expected and would 
raise just false alarms. Not sure how to trigger other types of errors, 
but I am sure I would like to see them, even if they did not cause the 
process to die. That is why I have used more complicated approach, which 
should print everything unexpected, even when dnsmasq is not stopped. In 
order to investigate you first have to know something unusual has happened.


On 23. 11. 23 0:29, Simon Kelley wrote:

Isn't this sufficient to fix the problem?

Not calling die() when bind-dynamic is set is intended to handle the 
case that  bind returns EADDRNOTAVAIL because you've configured 
--listen-address=1.2.3.4 but there's not a local interface with that 
address. dnsmasq runs anyway in the expectation that such an interface 
will appear in the future and a socket will be bound then.


I don't think there's a die()/syslog() conflict at all.


diff --git a/src/network.c b/src/network.c
index ca9fada..db1d528 100644
--- a/src/network.c
+++ b/src/network.c
@@ -924,7 +924,7 @@ static int make_sock(union mysockaddr *addr, int 
type, int dienow)

    {
  /* failure to bind addresses given by --listen-address at 
this point

 is OK if we're doing bind-dynamic */
- if (!option_bool(OPT_CLEVERBIND))
+ if (!option_bool(OPT_CLEVERBIND) || errno == EADDRINUSE)
    die(s, daemon->addrbuff, EC_BADNET);
    }
   else

Cheers,

Simon.

On 22/11/2023 19:27, Petr Menšík wrote:

Hello everyone,

I have received error report RHEL-16398 [1], which I think makes 
sense to fix even in the lastest version. I believe it allows 
non-intentional another instance running without error. What is 
worse, it does not even show any warning that initialization is 
incomplete.


Of course the problem at start is those errors happen in time when no 
log is available. I think that can be fixed easily by using stderr at 
that time. That is patch #1.


Second makes EADDRNOTAVAIL bind errors still hidden, but prints all 
other errors at least to stderr. On a system with systemd that should 
make it present in journalctl -u dnsmasq anyway. EADDRINUSE is made 
fatal, because that would not be usually handled by new addresses 
added later. If there is a need to start another dnsmasq instance 
without TCP listeners, I think that should be specified more 
explicitly. Makes EADDRINUSE fatal the same way as with 
--bind-interfaces.


Would you find any other errors, which should be hidden or made 
fatal? What would you think of those changes?


1. https://issues.redhat.com/browse/RHEL-16398


--
Petr Menšík
Software Engineer, RHEL
Red Hat, http://www.redhat.com/
PGP: DFCF908DB7C87E8E529925BC4931CA5B6C9FC5CB


___
Dnsmasq-discuss mailing list
Dnsmasq-discuss@lists.thekelleys.org.uk
https://lists.thekelleys.org.uk/cgi-bin/mailman/listinfo/dnsmasq-discuss


Re: [Dnsmasq-discuss] [PATCH] Refuse to start with EADDRINUSE in --bind-dynamic mode

2023-11-22 Thread Simon Kelley

Isn't this sufficient to fix the problem?

Not calling die() when bind-dynamic is set is intended to handle the 
case that  bind returns EADDRNOTAVAIL because you've configured 
--listen-address=1.2.3.4 but there's not a local interface with that 
address. dnsmasq runs anyway in the expectation that such an interface 
will appear in the future and a socket will be bound then.


I don't think there's a die()/syslog() conflict at all.


diff --git a/src/network.c b/src/network.c
index ca9fada..db1d528 100644
--- a/src/network.c
+++ b/src/network.c
@@ -924,7 +924,7 @@ static int make_sock(union mysockaddr *addr, int 
type, int dienow)

{
  /* failure to bind addresses given by --listen-address at 
this point

 is OK if we're doing bind-dynamic */
- if (!option_bool(OPT_CLEVERBIND))
+ if (!option_bool(OPT_CLEVERBIND) || errno == EADDRINUSE)
die(s, daemon->addrbuff, EC_BADNET);
}
   else

Cheers,

Simon.

On 22/11/2023 19:27, Petr Menšík wrote:

Hello everyone,

I have received error report RHEL-16398 [1], which I think makes sense 
to fix even in the lastest version. I believe it allows non-intentional 
another instance running without error. What is worse, it does not even 
show any warning that initialization is incomplete.


Of course the problem at start is those errors happen in time when no 
log is available. I think that can be fixed easily by using stderr at 
that time. That is patch #1.


Second makes EADDRNOTAVAIL bind errors still hidden, but prints all 
other errors at least to stderr. On a system with systemd that should 
make it present in journalctl -u dnsmasq anyway. EADDRINUSE is made 
fatal, because that would not be usually handled by new addresses added 
later. If there is a need to start another dnsmasq instance without TCP 
listeners, I think that should be specified more explicitly. Makes 
EADDRINUSE fatal the same way as with --bind-interfaces.


Would you find any other errors, which should be hidden or made fatal? 
What would you think of those changes?


1. https://issues.redhat.com/browse/RHEL-16398


___
Dnsmasq-discuss mailing list
Dnsmasq-discuss@lists.thekelleys.org.uk
https://lists.thekelleys.org.uk/cgi-bin/mailman/listinfo/dnsmasq-discuss


___
Dnsmasq-discuss mailing list
Dnsmasq-discuss@lists.thekelleys.org.uk
https://lists.thekelleys.org.uk/cgi-bin/mailman/listinfo/dnsmasq-discuss


[Dnsmasq-discuss] [PATCH] Refuse to start with EADDRINUSE in --bind-dynamic mode

2023-11-22 Thread Petr Menšík

Hello everyone,

I have received error report RHEL-16398 [1], which I think makes sense 
to fix even in the lastest version. I believe it allows non-intentional 
another instance running without error. What is worse, it does not even 
show any warning that initialization is incomplete.


Of course the problem at start is those errors happen in time when no 
log is available. I think that can be fixed easily by using stderr at 
that time. That is patch #1.


Second makes EADDRNOTAVAIL bind errors still hidden, but prints all 
other errors at least to stderr. On a system with systemd that should 
make it present in journalctl -u dnsmasq anyway. EADDRINUSE is made 
fatal, because that would not be usually handled by new addresses added 
later. If there is a need to start another dnsmasq instance without TCP 
listeners, I think that should be specified more explicitly. Makes 
EADDRINUSE fatal the same way as with --bind-interfaces.


Would you find any other errors, which should be hidden or made fatal? 
What would you think of those changes?


1. https://issues.redhat.com/browse/RHEL-16398

--
Petr Menšík
Software Engineer, RHEL
Red Hat, http://www.redhat.com/
PGP: DFCF908DB7C87E8E529925BC4931CA5B6C9FC5CB
From 207e9f4241c79b703320ae3568208e3a47cd25ef Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Petr=20Men=C5=A1=C3=ADk?= 
Date: Wed, 22 Nov 2023 20:04:14 +0100
Subject: [PATCH 2/2] Prevent starting another instance with --bind-dynamic

Previously startup bind() errors were silently dropped when starting with
--bind-dynamic. Make even in that mode EADDRINUSE error fatal to prevent
running another instance with half-initialized listeners.

On the other hand still hide address not available error, which is very
likely the reason for using bind-dynamic. Expect the address specified
will just appear later.
---
 src/network.c | 19 +++
 1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/src/network.c b/src/network.c
index ca9fada..f18be24 100644
--- a/src/network.c
+++ b/src/network.c
@@ -921,13 +921,9 @@ static int make_sock(union mysockaddr *addr, int type, int dienow)
   errno = errsave;
 
   if (dienow)
-	{
-	  /* failure to bind addresses given by --listen-address at this point
-	 is OK if we're doing bind-dynamic */
-	  if (!option_bool(OPT_CLEVERBIND))
-	die(s, daemon->addrbuff, EC_BADNET);
-	}
-  else
+	die(s, daemon->addrbuff, EC_BADNET);
+  else if (!option_bool(OPT_CLEVERBIND)
+	   || (option_bool(OPT_CLEVERBIND) && errno != EADDRNOTAVAIL))
 	my_syslog(LOG_WARNING, s, daemon->addrbuff, strerror(errno));
   
   return -1;
@@ -940,7 +936,14 @@ static int make_sock(union mysockaddr *addr, int type, int dienow)
 goto err;
   
   if ((rc = bind(fd, (struct sockaddr *)addr, sa_len(addr))) == -1)
-goto err;
+{
+  if (dienow && option_bool(OPT_CLEVERBIND) && errno != EADDRINUSE)
+	  /* failure to bind addresses given by --listen-address at this point
+	 is OK if we're doing bind-dynamic, except EADDRINUSE */
+	dienow = 0;
+
+  goto err;
+}
   
   if (type == SOCK_STREAM)
 {
-- 
2.42.0

From c1982e364c01a00c8b6454b677ae0dbe1ea4a382 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Petr=20Men=C5=A1=C3=ADk?= 
Date: Wed, 22 Nov 2023 20:00:01 +0100
Subject: [PATCH 1/2] Make stderr logging enabled until normal logging starts

Some kinds of errors like socket bind errors are done before dnsmasq
starts regular logging facility. Do not have those messages disappear,
but log them to stderr. As soon as log_start is called, that is resetted
according to configuration settings.
---
 src/log.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/log.c b/src/log.c
index 77032fb..6edcc09 100644
--- a/src/log.c
+++ b/src/log.c
@@ -35,7 +35,7 @@
 /* defaults in case we die() before we log_start() */
 static int log_fac = LOG_DAEMON;
 static int log_stderr = 0;
-static int echo_stderr = 0;
+static int echo_stderr = 1;
 static int log_fd = -1;
 static int log_to_file = 0;
 static int entries_alloced = 0;
-- 
2.42.0

___
Dnsmasq-discuss mailing list
Dnsmasq-discuss@lists.thekelleys.org.uk
https://lists.thekelleys.org.uk/cgi-bin/mailman/listinfo/dnsmasq-discuss