Re: [PATCH] [RFC] Linux network namespace support for haproxy [v2]

Willy Tarreau Tue, 18 Nov 2014 02:38:10 -0800

Hi Krisztian,

On Mon, Nov 17, 2014 at 06:08:22PM +0100, KOVACS Krisztian wrote:
> Dear Willy & list,
> 
> We've finally had some time to spend on the namespace support previously
> discussed here on the mailing list.


Great, thanks for this!

> Please find attached the new version of the patch implementing namespace
> support and hopefully fixing some of your concerns. Changes compared to the
> previous version of the patch:
> 
> * namespaces have to be explicitly listed on the configuration file, and
> the file descriptors are opened when loading the configuration file and
> cached

This raises an interesting question for the list participants here : do
we want to have an explicit list to limit access to the namespaces or do
we want the namespaces list to be automatically set up based on use ?

I'm seeing pros and cons for each solution :

   - pros of explicit list : allows an admin to limit access to certain
     namespaces only, which is convenient in contexts where multiple
     configs are assembled together from various untrusted users.

   - cons of explicit list : requires manual addition to the list for
     each new namespace, which can be really unconvenient when dealing
     with APIs.

I tend to prefer an automatic list (ie: namespaces are automatically
registered as soon as they're referenced without having a separate
section for this), at least because it matches better what we do for
everything else (eg: we don't declare IP addresses nor ports in a
separate section). And it would avoid a double declaration. But there
could be very valid reasons for that list, I don't know.

> * the previous syntax for specifying that the namespace of the server
> connection should be taken from the namespace of the client connection has
> been changed from 'namespace fromproxy' to 'namespace *'

I think it's a good idea.

> * we've added some documentation describing the flow of network namespace
> information

That certainly helps :-)

I'm having a few comments inlined about the patch below :

diff --git a/include/common/namespace.h b/include/common/namespace.h
new file mode 100644
index 0000000..4457d6c
--- /dev/null
+++ b/include/common/namespace.h
@@ -0,0 +1,24 @@
+#ifndef NAMESPACE_H
+
+#include <stdlib.h>
+#include <eb32tree.h>
+
+struct netns_entry;
+int socketat(const struct netns_entry *ns, int domain, int type, int protocol);
+
+#ifdef CONFIG_HAP_NS
+
+struct netns_entry
+{
+       struct eb32_node node;
+       const char *name;
+       size_t name_len;
+       int fd;
+};

This one above should use an ebpt_node instead which carries the pointer
to the string, like this :

struct netns_entry
{
        struct ebpt_node node;
        int fd;
};

(more info on that below).

diff --git a/src/connection.c b/src/connection.c
index 3af6d9a..373fb2b 100644
--- a/src/connection.c
+++ b/src/connection.c
@@ -14,6 +14,7 @@
 
 #include <common/compat.h>
 #include <common/config.h>
+#include <common/namespace.h>
 
 #include <proto/connection.h>
 #include <proto/fd.h>
@@ -217,6 +218,14 @@ void conn_update_sock_polling(struct connection *c)
        c->flags = f;
 }
 
+/*
+ * Get data length from tlv
+ */
+static size_t get_tlv_length(const struct tlv *src)
+{
+       return (src->length_hi << 8) | src->length_lo;
+}
+
 /* This handshake handler waits a PROXY protocol header at the beginning of the
  * raw data stream. The header looks like this :
  *
@@ -245,6 +254,7 @@ int conn_recv_proxy(struct connection *conn, int flag)
        char *line, *end;
        struct proxy_hdr_v2 *hdr_v2;
        const char v2sig[] = PP2_SIGNATURE;
+       size_t tlv_length = 0;
 
        /* we might have been called just after an asynchronous shutr */
        if (conn->flags & CO_FL_SOCK_RD_SH)
@@ -431,6 +441,7 @@ int conn_recv_proxy(struct connection *conn, int flag)
                        ((struct sockaddr_in *)&conn->addr.to)->sin_addr.s_addr 
= hdr_v2->addr.ip4.dst_addr;
                        ((struct sockaddr_in *)&conn->addr.to)->sin_port = 
hdr_v2->addr.ip4.dst_port;
                        conn->flags |= CO_FL_ADDR_FROM_SET | CO_FL_ADDR_TO_SET;
+                       tlv_length = ntohs(hdr_v2->len) - PP2_ADDR_LEN_INET;
                        break;
                case 0x21:  /* TCPv6 */
                        ((struct sockaddr_in6 *)&conn->addr.from)->sin6_family 
= AF_INET6;
@@ -440,8 +451,34 @@ int conn_recv_proxy(struct connection *conn, int flag)
                        memcpy(&((struct sockaddr_in6 
*)&conn->addr.to)->sin6_addr, hdr_v2->addr.ip6.dst_addr, 16);
                        ((struct sockaddr_in6 *)&conn->addr.to)->sin6_port = 
hdr_v2->addr.ip6.dst_port;
                        conn->flags |= CO_FL_ADDR_FROM_SET | CO_FL_ADDR_TO_SET;
+                       tlv_length = ntohs(hdr_v2->len) - PP2_ADDR_LEN_INET6;
                        break;
                }
+
+               /* TLV parsing */
+               if (tlv_length) {
+                       size_t tlv_offset = trash.len - tlv_length;
+                       while (tlv_offset < trash.len) {
+                               const struct tlv *tlv_packet = (struct tlv *) 
&trash.str[tlv_offset];
+                               const size_t tlv_len = 
get_tlv_length(tlv_packet);
+                               tlv_offset += tlv_len + TLV_HEADER_SIZE;
+
+                               switch (tlv_packet->type) {
+#ifdef CONFIG_HAP_NS
+                               case PP2_TYPE_NETNS: {
+                                       const struct netns_entry *ns;
+                                       ns = 
netns_store_lookup((char*)tlv_packet->value, tlv_len);
+                                       if (ns)
+                                               conn->proxy_netns = ns;
+                                       break;
+                               }
+#endif

I think some bounds checkings are missing above : nothing prevents tlv_len
from being larger than trash.len - tlv_offset, which means that a large
PPv2 header with an NS TLV entry close to the end could cause
netns_store_lookup() to read beyond the trash buffer's end. At least that's
how I'm reading it, I could be wrong.

diff --git a/src/namespace.c b/src/namespace.c
new file mode 100644
index 0000000..5983aca
--- /dev/null
+++ b/src/namespace.c
@@ -0,0 +1,109 @@
+#define _GNU_SOURCE
+
+#include <common/namespace.h>
+#include <common/compiler.h>
+#include <common/hash.h>
+
+#include <sched.h>
+#include <stdio.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <sys/types.h>
+#include <unistd.h>
+#include <sys/socket.h>
+
+#include <string.h>
+#ifdef CONFIG_HAP_NS
+
+/* Opens the namespace <ns_name> and returns the FD or -1 in case of error
+ * (check errno).
+ */
+static int open_named_namespace(const char *ns_name)
+{
+       char buf[512];
+       int fd;
+
+       snprintf(buf, sizeof(buf), "/var/run/netns/%s", ns_name);
+       fd = open(buf, O_RDONLY);
+
+       return fd;
+}

Here you need to test snprintf's return value otherwise we can open any crap
that remains in the stack if ns_name is too large. So both < 0 and >buf need
to return -1. BTW you don't even need to declare a local buffer for this, you
can use the trash buffer which is made exactly for that purpose, and use
chunk_printf() with it, it will synthesize the checks and return -1 in case
of error. That would give something like this :

static int open_named_namespace(const char *ns_name)
{
        if (chunk_printf(&trash, "/var/run/netns/%s", ns_name) < 0)
                return -1;
        return open(trash.str, O_RDONLY);
}

+/* Opens the current namespace and returns the FD or -1 in case of error
+ * (check errno). Caches its output so that once called it always returns
+ * the same fd.
+ *
+ * This is supposed to be called before calling any setns() syscalls so
+ * that it can open the file descriptor for the default namespace.
+ */
+static int get_default_namespace()
+{
+       static int fd = -1;
+
+       if (fd == -1) {

Please use "unlikely()" here since it will happen only at open time. We
could do even better, have this done in the constructor so that we never
have to call a function of this and directly reference the file descriptor.
That would also have the benefit of being thread-safe (thinking about the
future :-)).

+               char buf[512];
+               snprintf(buf, sizeof(buf), "/proc/%d/ns/net", getpid());
+               fd = open(buf, O_RDONLY);
+       }
+
+       return fd;
+}

Same comment as above, though in this case, there's no risk of overflow.

+static struct eb_root namespace_tree_root = EB_ROOT;
+
+struct netns_entry* netns_store_insert(const char *ns_name, size_t ns_name_len)
+{
+       struct netns_entry *entry = NULL;
+       int fd = open_named_namespace(ns_name);
+       if (fd == -1)
+               goto out;
+
+       entry = (struct netns_entry *)calloc(1, sizeof(struct netns_entry));

Maybe return NULL here if !entry ?

+       entry->fd = fd;
+       entry->name = strdup(ns_name);
+       entry->name_len = ns_name_len;
+       entry->node.key = hash_sdbm(entry->name, entry->name_len);
+
+       eb32_insert(&namespace_tree_root, &entry->node);
+out:
+       return entry;
+}

Please do not hash for inserting into the tree, it creates collisions and
needlessly complexifies the operations, better use an indirect string tree
which works with the ebpt_node instead, and get rid of the name_len which
is not used anymore :

        entry = (struct netns_entry *)calloc(1, sizeof(struct netns_entry));
        if (!entry)
                return entry;
        entry->fd = fd;
        entry->node.key = strdup(ns_name);
        ebis_insert(&namespace_tree_root, &entry->node);

+const struct netns_entry* netns_store_lookup(const char *ns_name, size_t 
ns_name_len)
+{
+       unsigned int hash;
+       struct eb32_node *node;
+
+       hash = hash_sdbm(ns_name, ns_name_len);
+       node = eb32_lookup(&namespace_tree_root, hash);
+       if (node)
+               return eb32_entry(node, struct netns_entry, node);
+       else
+               return NULL;
+}
+#endif

So here with the code simply becomes :

const struct netns_entry* netns_store_lookup(const char *ns_name)
{
        struct ebpt_node *node;

        node = ebis_lookup(&namespace_tree_root, ns_name);
        if (node)
                return ebpt_entry(node, struct netns_entry, node);
        else
                return NULL;
}


+/* Opens a socket in the namespace described by <ns> with the parameters 
<domain>,
+ * <type> and <protocol> and returns the FD or -1 in case of error (check 
errno).
+ */
+int socketat(const struct netns_entry *ns, int domain, int type, int protocol)

Just thinking about something, in Linux, many syscalls already exist with
the "at" suffix to indicate a variant working based on a file descriptor
pointing to a directory. While I'm not seeing any risk that "socketat"
would one day exist, I think we still have time to invent a less conflicting
name, what do you think ? Maybe something like ns_socket() or any other
idea ?

+{
+       int sock;
+       int current_namespace = get_default_namespace();
+
+#ifdef CONFIG_HAP_NS
+       if (ns && setns(ns->fd, CLONE_NEWNET) == -1)
+               return -1;
+#endif
+       sock = socket(domain, type, protocol);
+
+#ifdef CONFIG_HAP_NS
+       if (ns && setns(current_namespace, CLONE_NEWNET) == -1) {
+               close(sock);
+               return -1;
+       }
+#endif

Above I suspect that if you build without CONFIG_HAP_NS you'll get
a warning saying that current_namespace is not used. If you change
get_default_namespace() to just "static int default_namespace" which
is initialized by a constructor, this will simplify this part as well.

diff --git a/src/proto_tcp.c b/src/proto_tcp.c
index cfa62f7..128b5f7 100644
--- a/src/proto_tcp.c
+++ b/src/proto_tcp.c
@@ -33,6 +33,7 @@
 #include <common/errors.h>
 #include <common/mini-clist.h>
 #include <common/standard.h>
+#include <common/namespace.h>
 
 #include <types/global.h>
 #include <types/capture.h>
@@ -247,6 +248,15 @@ int tcp_bind_socket(int fd, int flags, struct 
sockaddr_storage *local, struct so
        return 0;
 }
 
+static int create_server_socket(struct connection *conn)

This name should probably mention that it's a TCP socket ?

+{
+       const struct netns_entry *ns = objt_server(conn->target)->netns;
+
+       if (objt_server(conn->target)->flags & SRV_F_USE_NS_FROM_PP)
+               ns = conn->proxy_netns;
+
+       return socketat(ns, conn->addr.to.ss_family, SOCK_STREAM, IPPROTO_TCP);
+}
 
 /*
  * This function initiates a TCP connection establishment to the target 
assigned
@@ -2007,6 +2023,30 @@ static int bind_parse_interface(char **args, int 
cur_arg, struct proxy *px, stru
 }
 #endif
 

OK overall that's really very very good. After reading all the code, I'm
still thinking that we should have the namespaces automatically declared.
It would basically work like this : "bind" and "server" lines referencing
a namespace would call netns_find_or_init(nsname). This function would
lookup the requested name, and if not found, try to initialize it exactly
like it's currently done in the namespace section. Upon error, the error
would simply be returned in the context of the bind and server lines which
reference the namespace.

I'm just seeing one point, maybe you created them for the proxy protocol
only ? Indeed, if you receive a connection mentionning a namespace that
was not initialized and you want to connect to the server using that
namespace, it would not exist. Thus maybe we could make the namespaces
section optional to serve only that purpose. That means that most users
who use explicit namespaces everywhere will never need it, and those who
want to support extra namespaces will need to declare them in this section.

What's your opinion ?

Are you interested that I give you some help on this last-mile polishing
regarding the review above so that you don't waste your time looking for
some specific points ?

Thanks!
Willy

Re: [PATCH] [RFC] Linux network namespace support for haproxy [v2]

Reply via email to