On 7/20/21 7:55 AM, Willy Tarreau wrote:
Hi Peter,
first, thanks for bringing this here.
On Tue, Jul 20, 2021 at 01:13:58AM -0500, Peter Jin wrote:
1. The network namespace support seems to be a bit broken. In the function
"my_socketat" (lines 114-129 of src/namespace.c in the latest dev branch),
you attempt to first change network namespace to the desired namespace, and
then change back to the default namespace. While this is correct, this does
not work in two cases, both involving user namespaces:
* First, HAProxy could be running in a non-initial user namespace with a
full set of capabilities (including CAP_SYS_ADMIN), but the network
namespace is still associated with the initial user namespace. Such an
environment could be simulated with "unshare -r" (omitting -n), or by using
a container runtime that supported user namespaces but the network namespace
is still associated with the host (e.g. docker run --net=host, if it were
supported in userns-remap mode). In this case, the first setns() would
succeed, but the setns() back to the original namespace would fail because
HAProxy would not have the CAP_SYS_ADMIN capability in the original network
namespace (it is owned by the initial user namespace).
Interesting. However, this will return a socket error, so the problem
will be detected. In the case of a listener, if the user loses CAP_SYS_ADMIN
then this will fail during startup and it will be visible. I'm more concerned
about the risk of runtime failure when connecting to a particular namespace.
The real problem I'm seeing is that there is no boot-time check on this
to verify that it still works after the return to the temporary namespace.
At the very least we could try during boot, for each server-side namespace,
to enter and leave their namespace and verify that it doesn't fail, otherwise
we'd quit.
You can probably guess this information by calling NS_GET_USERNS ioctl
on the original network namespace (i.e. /proc/self/ns/net), if it fails,
then it's outside the namespace scope.
To mitigate this,
HAProxy would need to fork a new process, call setns() and create socket in
the new process, and then transfer the socket back to the original process
using SCM_RIGHTS (you can probably reuse the code in proto_sockpair.c or
some other file mentioning SCM_RIGHTS to do that).
Unfortunately that's really not workable, it would cause terrible latencies
that are in no way compatible with HTTP usages. It *might* work for listeners
but not for outgoing connections, and whatever solution we'll find to those
will work on both sides.
I wasn't really proposing connecting in a namespace at all, just
creating a listener instead (like what appears to be the support as of
right now). Consequently, my ctrtool ns_open_file that I use for this
purpose doesn't even really support "client" sockets at all, just
"server" ones.
A possible solution would be to create a "slow" socketat (as described
above) for the listeners, and use the existing socketat() as a "fast"
socketat for outgoing connections. Or it would be possible to spawn new
child processes or threads that have been pre-created to be in the
target network namespace and communicate with them to "request" a socket
in that network namespace and pass it back over SCM_RIGHTS (to simplify
things, create IPv6 sockets only, and use IPv4-mapped IPv6 addresses for
IPv4 destinations; that way, the sockets would all be homogeneous to
eliminate guesswork).
* Second, HAProxy could be running as a non-root user, and at least one
"rootless" container with a separated network namespace exists for that
user. It would be nice if HAProxy could create a socket in such a network
namespace without root privileges. Judging by what I already see in the
code, that does not seem to be possible as it currently stands.
While we strongly encourage against running as root, I can indeed imagine
that it can be an issue with setns().
The solution
to solve this case is identical to the first case; the only difference is
that you also have to enter the associated user namespace first (hint: you
can use the NS_GET_USERNS ioctl on the target network namespace to obtain a
file descriptor to that user namespace, which you can pass to setns()) and
set PR_SET_DUMPABLE to 0 before entering the user namespace for security.
These techniques have already been employed in software like "slirp4netns",
which creates a TUN/TAP device in a given network namespace, and handles
both of the above cases correctly. The only difference is that for HAProxy,
we should be creating a socket instead, but the overall technique is still
the same.
Yes but while that's probably totally affordable in terms of latency
for the creation of a tunnel, it definitely is not to create an outgoing
socket. We're speaking about tens to hundreds of microseconds in the very
best case, probably even more sometimes.
One intermediate possibility might be to drop everything but CAP_SYS_ADMIN
but even then it leaves a lot of capabilities to an attacker :-/
I really think that the deepest problem is that there's still no
userland-friendly way to do socketat() without going through all that
mess :-/ The permissions ought to be checked once and it should be
possible to just create a socket on the same namespace as another one
designated from an FD that already passed the permission checks. Maybe
it's about time to reload the old discussions around that API that
ended exactly 10 years ago :-/
Again, not proposing to use the "slower" algorithm for outgoing
connections. I thought about doing something like that in my own
programs, but again, this latency issue would of course be a problem.
You can probably ask the linux kernel guys to reintroduce socketat()
again. The original argument against it (namely, that switching to the
new namespace, creating socket, and then switching back would be an
alternative) is no longer valid due to the existence of user namespaces
as I mentioned above (they didn't exist in their current form when
socketat() was first introduced). Not sure how the necessary permissions
went in the original proposal (I don't really remember if that call
could be used by an unprivileged user or if it required
root/CAP_SYS_ADMIN), but if you do decide to reintroduce socketat(),
then maybe at least require CAP_SYS_ADMIN in the user namespace that
governs the network namespace, and rely on the owner UID rule for user
namespaces. Or propose a new capability ("CAP_SETNS" probably?) for this
purpose along with it.
Another complaint about the network namespace support is that it only
supports namespaces in /var/run/netns. My own tool (search for "ctrtool
ns_open_file" on google), on the other hand, support network namespaces
created in arbitrary locations (and even allows creating sockets in
arbitrary namespaces that also account for the above two user namespace
scenarios). It would be nice if HAProxy supported arbitrary network
namespace locations too, to support the rootless container use case.
I have no opinion on this and didn't even remember about this path. I
agree that it would make sense to have a tunable in the global section
to change this! Are you interested in proposing a patch to do this ?
Sure, if the "namespace" string contains a slash, then interpret that as
a filename. Otherwise interpret it as a pathname relative to
/var/run/netns. Alternatively, make a new option called something like
"netns-file" that specifies the pathname of the new namespace (the value
of the "namespace" argument would still be used in the proxy protocol)
or "netns-fd" to specify an inherited file descriptor for the network
namespace.
Regarding how it is right now, it's true that mentions of network
namespaces in places other than /var/run/netns really don't seem to
exist on the Internet. Docker creates network namespaces in
/var/run/docker/netns, but the hex values might not be predictable. "ip
netns" doesn't really support them, but you could create one by running
mkdir /run/whatever
touch /run/whatever/my_netns
unshare -n mount --bind /proc/self/ns/net /run/whatever/my_netns
2. There is a stack buffer overflow found in one of the files. Not
disclosing it here because this email will end up on the public mailing
list. If there is a "security" email address I could disclose it to, what is
it?
As Lukas mentioned, it's secur...@haproxy.org, and indeed it's not prominent
enough.
Actually, there is no security issue (what I thought was a security
issue actually wasn't one). Forget about it. It's mentioned in the other
thread if you're curious. But at least put the security email address in
on the front page of haproxy.org.
3. There was another feature that I felt was really broken, but since it is
related to #2 (it's associated with the same file that the stack buffer
overflow exists on), I'm not disclosing it here publicly either. (The issue
itself has nothing to do with security, but I will only disclose this after
#2 has been resolved.)
No problem!
Since #2 was ultimately not a security issue, I'll give it here:
The feature that I thought was broken was the "sockpair@" method to a
frontend bind. In tests/unit/test-sockpair.py, it appears that HAProxy
would have to inherit two connected socket pair file descriptors of type
SOCK_STREAM. This socket type is validated in src/proto_sockpair.c line
443 and it seems like it would be rejected if it weren't that type.
However, my use case involves sending file descriptors returned by
accept() over from an external daemon ("Socketbox", to be exact, also
presented on my website). That daemon actually uses SOCK_DGRAM for the
unix sockets. Requiring the socket to be of type SOCK_STREAM would
complicate the logic of Socketbox too much. I maintain a fork of the
stable 2.4 on https://git2.peterjin.org/haproxy that does not do that so
that HAProxy would be compatible with that daemon. Allowing
SOCK_DGRAM/SOCK_SEQPACKET shouldn't be an issue, since Unixes that don't
support that aren't really affected.
Furthermore, there are two other issues with it:
1. The source and destination IP addresses are not visible using %[src]
and %[dst]. It appears that proto_fam_sockpair.get_src/get_dst is NULL,
so it just ignores them. My use case involves sending TCP sockets over
the unix socket, rather than creating unix sockets, so I would need that
information to be visible. My fork, as I mentioned above, has this changed.
2. If the socket(pair) connection is broken, then HAProxy results in
100% cpu usage on all of its threads. I currently haven't figured out
what's wrong with that, and I haven't really checked after testing it,
but since my use case involves SOCK_DGRAM (which can never be broken), I
haven't really patched it in that regard.
Again, all three issues might be very specific to how I'm using it, but
since "sockpair" is a much lesser used thing (I haven't really found
anything on the Internet using it), that use case might not have been
anticipated. The documentation lists it as
like fd@ but you must use the fd of a
connected unix socket or of a socketpair. The bind waits
to receive a FD over the unix socket and uses it as if it
was the FD of an accept(). Should be used carefully.
and it would sound like what I'm currently describing would be a perfect
use case for that, however, none of the three issues (especially the
first and second ones) above were obvious in the above description.
Since this is a feature that would be extremely useful for my use case
(nothing else that I've found really supports passing sockets in such a
way), it would be nice if these three issues could be remedied.
Thanks!
Willy
Peter Jin