Sean Hefty wrote:
The correct solution in my mind is to use the host stack's TCP port
space for _all_ RDMA_PS_TCP port allocations.   The patch below is a
minimal delta to unify the port spaces bay using the kernel stack to
bind ports.  This is done by allocating a kernel socket and binding to
the appropriate local addr/port.  It also allows the kernel stack to
pick ephemeral ports by virtue of just passing in port 0 on the kernel
bind operation.

I'm not thrilled with the idea of overlapping port spaces, and I can't come up
with a solution that works for all situations.  I understand the overlapping
port space problem, but I consider the ability to use the same port number for
both RDMA and sockets a feature.

What if MPI used a similar mechanism as SDP?  That is, if it gets a port number
from sockets, it reserves that same RDMA port number, or vice-versa.  The
rdma_cm advertises separate port spaces from TCP/UDP, so IMO any assumption
otherwise, at this point, is a bug in the user's code.

This would require EVERY rdma application to do this or potentially steal listening endpoints from sockets applications already running. This isn't just an MPI issue. Consider the iperf daemon running on port 5001. Then you run mvapich2/rdma and it happens to bind/listen to port 5001. Then an incoming innocent iperf connection gets passed to the rdma stack in error...


Before merging the port spaces, I'd like a way for an application to use a
single well-known port number that works over both RDMA and sockets.


So you want the ability for an application to support a standard TCP/Sockets service concurrently with an RDMA service? To do this effectively over iWARP, you need the ability to take a streaming native host connection and migrate it to rdma mode and vice-versa. And that model isn't very popular.

Consider NFS and NFS-RDMA. The NFS gurus struggled with this very issue and concluded that the RDMA service needs to be on a separate port. Thus they are proposing a new netid/port number for doing RDMA mounts vs TCP/UDP mounts. IMO that is the correct way to go: RDMA services are different that tcp services. They use a different protocol on top of TCP and thus shouldn't be handled on the same TCP port. So, applications that want to service Sockets and RDMA services concurrently would do so by listening on different ports...


RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.

Is there any reason to limit this behavior to TCP only, or would we also include
UDP?


The iWARP protocols don't include a UDP based service, so it is not needed. But if you're calling it a UDP port space, maybe it should be the host's port space?

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 9e0ab04..e4d2d7f 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -111,6 +111,7 @@ struct rdma_id_private {
        struct rdma_cm_id       id;

        struct rdma_bind_list   *bind_list;
+       struct socket           *sock;

This points off to a rather largish structure...


Yes. The only exports interfaces into the host port allocation stuff requires a socket struct. I didn't want to try and tackle exporting the port allocation services at a lower level. Even at the bottom level, I think it still assumes a socket struct...

        struct hlist_node       node;
        struct list_head        list;
        struct list_head        listen_list;
@@ -695,6 +696,8 @@ static void cma_release_port(struct rdma
                kfree(bind_list);
        }
        mutex_unlock(&lock);
+       if (id_priv->sock)
+               sock_release(id_priv->sock);
 }

 void rdma_destroy_id(struct rdma_cm_id *id)
@@ -1790,6 +1793,25 @@ static int cma_use_port(struct idr *ps,
        return 0;
 }

+static int cma_get_tcp_port(struct rdma_id_private *id_priv)
+{
+       int ret;
+       struct socket *sock;
+
+       ret = sock_create_kern(AF_INET, SOCK_STREAM, IPPROTO_TCP, &sock);
+       if (ret)
+               return ret;
+       ret = sock->ops->bind(sock,
+                         (struct socketaddr *)&id_priv->id.route.addr.src_addr,
+                         ip_addr_size(&id_priv->id.route.addr.src_addr));
+       if (ret) {
+               sock_release(sock);
+               return ret;
+       }
+       id_priv->sock = sock;
+       return 0;
+}
+
 static int cma_get_port(struct rdma_id_private *id_priv)
 {
        struct idr *ps;
@@ -1801,6 +1823,9 @@ static int cma_get_port(struct rdma_id_p
                break;
        case RDMA_PS_TCP:
                ps = &tcp_ps;
+               ret = cma_get_tcp_port(id_priv); /* Synch with native stack */
+               if (ret)
+                       goto out;

Would we need tcp_ps (and udp_ps) anymore?

I left it all in to show the minimal changes needed to implement the functionality. To keep the patch simple for initial consumption. But yes, the rdma-cm really doesn't need to track the port stuff for TCP since the host stack does.

Also, I think SDP maps into the TCP
port space already, so changes to SDP will be needed as well, which may
eliminate its port space.

I haven't looked in detail at the SDP code, but I would think it should want the TCP port space and not its own anwyay, but I'm not sure. What is the point of the SDP port space anyway?

BTW: I believe for SDP to work over iWARP, a "shared by all SDP peers" port mapper service is required to map TCP port A to a different SDP port so that the iwarp SDP connections don't collide with real TCP connections for the same application service. Thus 2 ports are really used. Regardless, both ports would need to be reserved from the host stack port space or the issue will happen: some sockets app might also bind to the SDP port and be denied service. I think its all a non-issue for IB since IPoIB and RDMA/IB use different QPs/connections/contexts/etc...

One more issue for discussion:

This patch is reserving ports exclusively even if they are just ephemeral ports used on the active side. This isn't really required, and limits the amount of connections you can have. The current rdma-cm, however, does the same thing. We could consider using lower level services like inet_hash_connect().



Steve.

_______________________________________________
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to