Re: HAProxy Network Namespace and sockpair issues.

Peter Jin Tue, 20 Jul 2021 08:33:07 -0700


On 7/20/21 7:55 AM, Willy Tarreau wrote:

Hi Peter,


first, thanks for bringing this here.

On Tue, Jul 20, 2021 at 01:13:58AM -0500, Peter Jin wrote:

1. The network namespace support seems to be a bit broken. In the function
"my_socketat" (lines 114-129 of src/namespace.c in the latest dev branch),
you attempt to first change network namespace to the desired namespace, and
then change back to the default namespace. While this is correct, this does
not work in two cases, both involving user namespaces:

* First, HAProxy could be running in a non-initial user namespace with a
full set of capabilities (including CAP_SYS_ADMIN), but the network
namespace is still associated with the initial user namespace. Such an
environment could be simulated with "unshare -r" (omitting -n), or by using
a container runtime that supported user namespaces but the network namespace
is still associated with the host (e.g. docker run --net=host, if it were
supported in userns-remap mode). In this case, the first setns() would
succeed, but the setns() back to the original namespace would fail because
HAProxy would not have the CAP_SYS_ADMIN capability in the original network
namespace (it is owned by the initial user namespace).

Interesting. However, this will return a socket error, so the problem
will be detected. In the case of a listener, if the user loses CAP_SYS_ADMIN
then this will fail during startup and it will be visible. I'm more concerned
about the risk of runtime failure when connecting to a particular namespace.

The real problem I'm seeing is that there is no boot-time check on this
to verify that it still works after the return to the temporary namespace.
At the very least we could try during boot, for each server-side namespace,
to enter and leave their namespace and verify that it doesn't fail, otherwise
we'd quit.

You can probably guess this information by calling NS_GET_USERNS ioctlon the original network namespace (i.e. /proc/self/ns/net), if it fails,then it's outside the namespace scope.

To mitigate this,
HAProxy would need to fork a new process, call setns() and create socket in
the new process, and then transfer the socket back to the original process
using SCM_RIGHTS (you can probably reuse the code in proto_sockpair.c or
some other file mentioning SCM_RIGHTS to do that).

Unfortunately that's really not workable, it would cause terrible latencies
that are in no way compatible with HTTP usages. It *might* work for listeners
but not for outgoing connections, and whatever solution we'll find to those
will work on both sides.

I wasn't really proposing connecting in a namespace at all, justcreating a listener instead (like what appears to be the support as ofright now). Consequently, my ctrtool ns_open_file that I use for thispurpose doesn't even really support "client" sockets at all, just"server" ones.

A possible solution would be to create a "slow" socketat (as describedabove) for the listeners, and use the existing socketat() as a "fast"socketat for outgoing connections. Or it would be possible to spawn newchild processes or threads that have been pre-created to be in thetarget network namespace and communicate with them to "request" a socketin that network namespace and pass it back over SCM_RIGHTS (to simplifythings, create IPv6 sockets only, and use IPv4-mapped IPv6 addresses forIPv4 destinations; that way, the sockets would all be homogeneous toeliminate guesswork).

* Second, HAProxy could be running as a non-root user, and at least one
"rootless" container with a separated network namespace exists for that
user. It would be nice if HAProxy could create a socket in such a network
namespace without root privileges. Judging by what I already see in the
code, that does not seem to be possible as it currently stands.

While we strongly encourage against running as root, I can indeed imagine
that it can be an issue with setns().

The solution
to solve this case is identical to the first case; the only difference is
that you also have to enter the associated user namespace first (hint: you
can use the NS_GET_USERNS ioctl on the target network namespace to obtain a
file descriptor to that user namespace, which you can pass to setns()) and
set PR_SET_DUMPABLE to 0 before entering the user namespace for security.

These techniques have already been employed in software like "slirp4netns",
which creates a TUN/TAP device in a given network namespace, and handles
both of the above cases correctly. The only difference is that for HAProxy,
we should be creating a socket instead, but the overall technique is still
the same.

Yes but while that's probably totally affordable in terms of latency
for the creation of a tunnel, it definitely is not to create an outgoing
socket. We're speaking about tens to hundreds of microseconds in the very
best case, probably even more sometimes.

One intermediate possibility might be to drop everything but CAP_SYS_ADMIN
but even then it leaves a lot of capabilities to an attacker :-/

I really think that the deepest problem is that there's still no
userland-friendly way to do socketat() without going through all that
mess :-/ The permissions ought to be checked once and it should be
possible to just create a socket on the same namespace as another one
designated from an FD that already passed the permission checks. Maybe
it's about time to reload the old discussions around that API that
ended exactly 10 years ago :-/

Again, not proposing to use the "slower" algorithm for outgoingconnections. I thought about doing something like that in my ownprograms, but again, this latency issue would of course be a problem.

You can probably ask the linux kernel guys to reintroduce socketat()again. The original argument against it (namely, that switching to thenew namespace, creating socket, and then switching back would be analternative) is no longer valid due to the existence of user namespacesas I mentioned above (they didn't exist in their current form whensocketat() was first introduced). Not sure how the necessary permissionswent in the original proposal (I don't really remember if that callcould be used by an unprivileged user or if it requiredroot/CAP_SYS_ADMIN), but if you do decide to reintroduce socketat(),then maybe at least require CAP_SYS_ADMIN in the user namespace thatgoverns the network namespace, and rely on the owner UID rule for usernamespaces. Or propose a new capability ("CAP_SETNS" probably?) for thispurpose along with it.

Another complaint about the network namespace support is that it only
supports namespaces in /var/run/netns. My own tool (search for "ctrtool
ns_open_file" on google), on the other hand, support network namespaces
created in arbitrary locations (and even allows creating sockets in
arbitrary namespaces that also account for the above two user namespace
scenarios). It would be nice if HAProxy supported arbitrary network
namespace locations too, to support the rootless container use case.

I have no opinion on this and didn't even remember about this path. I
agree that it would make sense to have a tunable in the global section
to change this! Are you interested in proposing a patch to do this ?

Sure, if the "namespace" string contains a slash, then interpret that asa filename. Otherwise interpret it as a pathname relative to/var/run/netns. Alternatively, make a new option called something like"netns-file" that specifies the pathname of the new namespace (the valueof the "namespace" argument would still be used in the proxy protocol)or "netns-fd" to specify an inherited file descriptor for the networknamespace.

Regarding how it is right now, it's true that mentions of networknamespaces in places other than /var/run/netns really don't seem toexist on the Internet. Docker creates network namespaces in/var/run/docker/netns, but the hex values might not be predictable. "ipnetns" doesn't really support them, but you could create one by running


mkdir /run/whatever

touch /run/whatever/my_netns

unshare -n mount --bind /proc/self/ns/net /run/whatever/my_netns

2. There is a stack buffer overflow found in one of the files. Not
disclosing it here because this email will end up on the public mailing
list. If there is a "security" email address I could disclose it to, what is
it?

As Lukas mentioned, it's secur...@haproxy.org, and indeed it's not prominent
enough.

Actually, there is no security issue (what I thought was a securityissue actually wasn't one). Forget about it. It's mentioned in the otherthread if you're curious. But at least put the security email address inon the front page of haproxy.org.

3. There was another feature that I felt was really broken, but since it is
related to #2 (it's associated with the same file that the stack buffer
overflow exists on), I'm not disclosing it here publicly either. (The issue
itself has nothing to do with security, but I will only disclose this after
#2 has been resolved.)

No problem!


Since #2 was ultimately not a security issue, I'll give it here:

The feature that I thought was broken was the "sockpair@" method to afrontend bind. In tests/unit/test-sockpair.py, it appears that HAProxywould have to inherit two connected socket pair file descriptors of typeSOCK_STREAM. This socket type is validated in src/proto_sockpair.c line443 and it seems like it would be rejected if it weren't that type.

However, my use case involves sending file descriptors returned byaccept() over from an external daemon ("Socketbox", to be exact, alsopresented on my website). That daemon actually uses SOCK_DGRAM for theunix sockets. Requiring the socket to be of type SOCK_STREAM wouldcomplicate the logic of Socketbox too much. I maintain a fork of thestable 2.4 on https://git2.peterjin.org/haproxy that does not do that sothat HAProxy would be compatible with that daemon. AllowingSOCK_DGRAM/SOCK_SEQPACKET shouldn't be an issue, since Unixes that don'tsupport that aren't really affected.


Furthermore, there are two other issues with it:

1. The source and destination IP addresses are not visible using %[src]and %[dst]. It appears that proto_fam_sockpair.get_src/get_dst is NULL,so it just ignores them. My use case involves sending TCP sockets overthe unix socket, rather than creating unix sockets, so I would need thatinformation to be visible. My fork, as I mentioned above, has this changed.

2. If the socket(pair) connection is broken, then HAProxy results in100% cpu usage on all of its threads. I currently haven't figured outwhat's wrong with that, and I haven't really checked after testing it,but since my use case involves SOCK_DGRAM (which can never be broken), Ihaven't really patched it in that regard.

Again, all three issues might be very specific to how I'm using it, butsince "sockpair" is a much lesser used thing (I haven't really foundanything on the Internet using it), that use case might not have beenanticipated. The documentation lists it as


like fd@ but you must use the fd of a
                  connected unix socket or of a socketpair. The bind waits
                  to receive a FD over the unix socket and uses it as if it
                  was the FD of an accept(). Should be used carefully.

and it would sound like what I'm currently describing would be a perfectuse case for that, however, none of the three issues (especially thefirst and second ones) above were obvious in the above description.

Since this is a feature that would be extremely useful for my use case(nothing else that I've found really supports passing sockets in such away), it would be nice if these three issues could be remedied.

Thanks!
Willy

Peter Jin

Re: HAProxy Network Namespace and sockpair issues.

Reply via email to