[ewg] RE: [PATCH] link-local address fix for rdma_resolve_addr
On Thu, 2009-10-22 at 15:02 -0700, Sean Hefty wrote: For ipv6 I ran what I described previously. What I do need to do is add the option to rping to specify a source address and run it with various address. Any help you can give defining what exactly needs to be tested would be appreciated. You can also test with ucmatose to verify ipv4 still works. Use the -b option to bind to a specific address. - Sean Sean This patch adds ipv6 support to ucmatose. Signed-off-by: David Wilder dwil...@us.ibm.com -- examples/cmatose.c | 31 +++ 1 files changed, 19 insertions(+), 12 deletions(-) diff --git a/examples/cmatose.c b/examples/cmatose.c index 8c12347..481a6d0 100644 --- a/examples/cmatose.c +++ b/examples/cmatose.c @@ -516,15 +516,15 @@ static int get_addr(char *dst, struct sockaddr_in *addr) return ret; } - if (res-ai_family != PF_INET) { + if (res-ai_family == PF_INET) + memcpy(addr, res-ai_addr, sizeof(struct sockaddr_in)); + else if (res-ai_family == PF_INET6) + memcpy(addr, res-ai_addr, sizeof(struct sockaddr_in6)); + else ret = -1; - goto out; - } - *addr = *(struct sockaddr_in *) res-ai_addr; -out: - freeaddrinfo(res); - return ret; +freeaddrinfo(res); +return ret; } static int run_server(void) @@ -543,11 +543,18 @@ static int run_server(void) ret = get_addr(src_addr, test.src_in); if (ret) goto out; - } else + if(test.src_in.sin_family == AF_INET) + ((struct sockaddr_in *) test.src_in)-sin_port = port; + else + ((struct sockaddr_in6 *) test.src_in)-sin6_port=port; + + } else { test.src_in.sin_family = PF_INET; + test.src_in.sin_port = port; + } + + ret = rdma_bind_addr(listen_id, (struct sockaddr *)test.src_in); - test.src_in.sin_port = port; - ret = rdma_bind_addr(listen_id, test.src_addr); if (ret) { perror(cmatose: bind address failed); goto out; @@ -628,8 +635,8 @@ static int run_client(void) printf(cmatose: connecting\n); for (i = 0; i connections; i++) { ret = rdma_resolve_addr(test.nodes[i].cma_id, - src_addr ? test.src_addr : NULL, - test.dst_addr, 2000); + src_addr ? (struct sockaddr *)test.src_in : NULL, + (struct sockaddr *)test.dst_in, 2000); if (ret) { perror(cmatose: failure getting addr); connect_error(); ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Re: [PATCH] link-local address fix for rdma_resolve_addr
On Wed, 2009-10-21 at 17:08 -0600, Jason Gunthorpe wrote: This looks exactly like what I was thinking of - have you tested this? Yes I did do some testing, but that brings up a good question. I am not sure I know what all should be tested? I am running rping with different destination address (and scoping). On the ipv4 side: rping -c -a ip-of-my-ib0-interface rping -c -a ip-of-remote-nodes-ib0-interface For ipv6 I ran what I described previously. What I do need to do is add the option to rping to specify a source address and run it with various address. Any help you can give defining what exactly needs to be tested would be appreciated. If it is OK, I'd make it the first in the series. There were two things I was not sure about in my example. 1) Is 'init_net.loopback_dev' the correct reference for the loop device? Or is it something like dev_net(rt-idev-dev)-loopback_dev ? I'm sensing it may be the latter, but can't investigate right now Donno much about this new namespace stuff really I think you may be correct I will look at that closer. I did explicitly verify the test worked in both cases. 2) Was rt-idev-dev the right choice for the ipv4 case? Or is it rt-u.dst.dev ? The TCP case kinda looks like int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len) tmp = ip_route_connect(rt, nexthop, inet-saddr, RT_CONN_FLAGS(sk), sk-sk_bound_dev_if, IPPROTO_TCP, inet-sport, usin-sin_port, sk, 1); sk_setup_caps(sk, rt-u.dst); void sk_setup_caps(struct sock *sk, struct dst_entry *dst) __sk_dst_set(sk, dst); And all later things key off the sk_get_dst. So I'm thinking that u.dst.dev might be correct. I have no idea what the difference is though (can't look too hard right now) The main other fixup I see is to remove ret = cma_bind_addr(id, src_addr, dst_addr); From rdma_resolve_addr and rely on the routing lookup in addr_resolve_remote called by addr_resolve_ip to setup the bind device from the routing lookup. (This is what I mentioned in my last email) Which then lets you fixup the checking and handling of the sin6_scopeid on the source address - and fixes the main other routing difference against the TCP stack. Thanks for working on this! Jason Lots of discussion :) I will go through the mails, address the comments and post the entire series of patches. Thanks for all your input. Dave. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] RE: [PATCH] link-local address fix for rdma_resolve_addr
For ipv6 I ran what I described previously. What I do need to do is add the option to rping to specify a source address and run it with various address. Any help you can give defining what exactly needs to be tested would be appreciated. You can also test with ucmatose to verify ipv4 still works. Use the -b option to bind to a specific address. - Sean ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Re: [PATCH] link-local address fix for rdma_resolve_addr
On Wed, Oct 21, 2009 at 03:30:29PM -0700, David J. Wilder wrote: addr6_resolve_remote(). The second patch fixes the bugs in addr_resolve_local(). The 3d patch I am posting now for discussion. This patch, as Jason's suggested, moves the function of addr_resolve_local() into addr4_resolve_remote() and addr6_resolve_remote(). It eliminates the need for addr_resolve_local(). This looks exactly like what I was thinking of - have you tested this? If it is OK, I'd make it the first in the series. There were two things I was not sure about in my example. 1) Is 'init_net.loopback_dev' the correct reference for the loop device? Or is it something like dev_net(rt-idev-dev)-loopback_dev ? I'm sensing it may be the latter, but can't investigate right now Donno much about this new namespace stuff really 2) Was rt-idev-dev the right choice for the ipv4 case? Or is it rt-u.dst.dev ? The TCP case kinda looks like int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len) tmp = ip_route_connect(rt, nexthop, inet-saddr, RT_CONN_FLAGS(sk), sk-sk_bound_dev_if, IPPROTO_TCP, inet-sport, usin-sin_port, sk, 1); sk_setup_caps(sk, rt-u.dst); void sk_setup_caps(struct sock *sk, struct dst_entry *dst) __sk_dst_set(sk, dst); And all later things key off the sk_get_dst. So I'm thinking that u.dst.dev might be correct. I have no idea what the difference is though (can't look too hard right now) The main other fixup I see is to remove ret = cma_bind_addr(id, src_addr, dst_addr); From rdma_resolve_addr and rely on the routing lookup in addr_resolve_remote called by addr_resolve_ip to setup the bind device from the routing lookup. (This is what I mentioned in my last email) Which then lets you fixup the checking and handling of the sin6_scopeid on the source address - and fixes the main other routing difference against the TCP stack. Thanks for working on this! Jason ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] RE: [PATCH] link-local address fix for rdma_resolve_addr
This patch, as Jason's suggested, moves the function of addr_resolve_local() into addr4_resolve_remote() and addr6_resolve_remote(). It eliminates the need for addr_resolve_local(). One quick comment, remove '_remote' from function names: addr4_resolve_remote, addr6_resolve_remote, and addr_resolve_remote drivers/infiniband/core/addr.c | 99 +++ 1 files changed, 18 insertions(+), 81 deletions(-) diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c index bd07803..f7a5861 100644 --- a/drivers/infiniband/core/addr.c +++ b/drivers/infiniband/core/addr.c @@ -233,6 +233,14 @@ static int addr4_resolve_remote(struct sockaddr_in *src_in, if (ret) goto out; + if (rt-idev-dev == init_net.loopback_dev){ + ret = rdma_translate_ip((struct sockaddr *)dst_in, addr); + if (!ret) + memcpy(addr-dst_dev_addr, addr-src_dev_addr, + MAX_ADDR_LEN); + goto put; + } This doesn't end up doing the same thing as what resolve_local did. It only matches up with the 'else if' portion below: -static int addr_resolve_local(struct sockaddr *src_in, -struct sockaddr *dst_in, -struct rdma_dev_addr *addr) -{ - struct net_device *dev; - int ret; - - switch (dst_in-sa_family) { - case AF_INET: - { - __be32 src_ip = ((struct sockaddr_in *) src_in)-sin_addr.s_addr; - __be32 dst_ip = ((struct sockaddr_in *) dst_in)-sin_addr.s_addr; - - dev = ip_dev_find(init_net, dst_ip); - if (!dev) - return -EADDRNOTAVAIL; - - if (ipv4_is_zeronet(src_ip)) { - src_in-sa_family = dst_in-sa_family; - ((struct sockaddr_in *) src_in)-sin_addr.s_addr = dst_ip; - ret = rdma_copy_addr(addr, dev, dev-dev_addr); - } else if (ipv4_is_loopback(src_ip)) { - ret = rdma_translate_ip(dst_in, addr); - if (!ret) - memcpy(addr-dst_dev_addr, dev-dev_addr, MAX_ADDR_LEN); - } else { - ret = rdma_translate_ip(src_in, addr); - if (!ret) - memcpy(addr-dst_dev_addr, dev-dev_addr, MAX_ADDR_LEN); - } We need to handle the case where the source address is not given and provide one. We also need to handle the case where the source address is given, but may not match the destination address, or even use the same RDMA device. I didn't look at the ipv6 changes yet. - Sean ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Re: [PATCH] link-local address fix for rdma_resolve_addr
On Wed, Oct 21, 2009 at 04:17:01PM -0700, Sean Hefty wrote: This doesn't end up doing the same thing as what resolve_local did. It only matches up with the 'else if' portion below: It does all the behaviors, it passes the input into a routing lookup - though it is true the source needs to be passed into the routing lookup, which I think David had in another patch. Dave; I guess you need to include that here now. So, first case: -if (ipv4_is_zeronet(src_ip)) { Looks like: $ ip route get 10.0.0.11 local 10.0.0.11 dev lo src 10.0.0.11 cache local mtu 16436 advmss 16396 hoplimit 64 trips through the if == loopback and does 'rdma_translate_ip(10.0.0.11)' Same as the old code (little different path, but seems to give the same result) Second case: -} else if (ipv4_is_loopback(src_ip)) { Looks like: $ ip route get 10.0.0.11 from 127.0.0.1 local 10.0.0.11 from 127.0.0.1 dev lo cache local mtu 16436 advmss 16396 hoplimit 64 so trips through the if == loopback and does rdma_translate_ip(10.0.0.11) Same as the old code Third case: -} else { $ ip route get 10.0.0.11 from 10.0.0.11 local 10.0.0.11 from 10.0.0.11 dev lo cache local mtu 16436 advmss 16396 hoplimit 64 Again, does rdma_translate_ip(10.0.0.11) And the weird case is different: $ ip route get 10.0.0.11 from 192.168.122.1 local 10.0.0.11 from 192.168.122.1 dev lo cache local mtu 16436 advmss 16396 hoplimit 64 (192.168.122.1 is bound to a different device on my system than 10.0.0.11) The new case trips the if == loopback and does rdma_translate_ip(10.0.0.11) The old case does rdma_translate_ip(192.168.122.1) I don't think this is a significant difference, both behaviors are reasonable choices and the code/complexity savings are worth it, IMHO. [Alternatively, I suppose one could call rdma_translate_ip(rt-rt_src) and if that fails then do rdma_translate_ip(dst_in) but why bother?] Jason ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Re: [PATCH] link-local address fix for rdma_resolve_addr
On Wed, Oct 21, 2009 at 05:36:39PM -0600, Jason Gunthorpe wrote: - if (ipv4_is_zeronet(src_ip)) { Looks like: $ ip route get 10.0.0.11 local 10.0.0.11 dev lo src 10.0.0.11 cache local mtu 16436 advmss 16396 hoplimit 64 trips through the if == loopback and does 'rdma_translate_ip(10.0.0.11)' Same as the old code (little different path, but seems to give the same result) Oops, there is a little woopsie here: Dave: + if (rt-idev-dev == init_net.loopback_dev){ + ret = rdma_translate_ip((struct sockaddr *)dst_in, addr); + if (!ret) + memcpy(addr-dst_dev_addr, addr-src_dev_addr, + MAX_ADDR_LEN); + goto put; + } + The 'goto put' will skip over the source address assignment step. Maybe move this: if (!src_ip) { src_in-sin_family = dst_in-sin_family; src_in-sin_addr.s_addr = rt-rt_src; } Up above your if (rt-idev) And similarly for v6. Also add a src_in = rt-rt_src assignment for v6. I'd also remove the test for 0 address, just do it unconditionally (IIRC routing table always returns src if src is not 0) Jason ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] RE: [PATCH] link-local address fix for rdma_resolve_addr
$ ip route get 10.0.0.11 from 192.168.122.1 local 10.0.0.11 from 192.168.122.1 dev lo cache local mtu 16436 advmss 16396 hoplimit 64 (192.168.122.1 is bound to a different device on my system than 10.0.0.11) The new case trips the if == loopback and does rdma_translate_ip(10.0.0.11) The old case does rdma_translate_ip(192.168.122.1) I don't think this is a significant difference, both behaviors are reasonable choices and the code/complexity savings are worth it, IMHO. [Alternatively, I suppose one could call rdma_translate_ip(rt-rt_src) and if that fails then do rdma_translate_ip(dst_in) but why bother?] I agree this usage case is weird, but I'm still not sure about the patch. If the destination service is listening on 10.0.0.11, then the listen is bound to a different gid than the source gid that we're trying to connect from. The src_dev_addr and dst_dev_addr need to reflect this, so that a PR gives us a path that routes the CM REQ correctly. - Sean ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] RE: [PATCH] link-local address fix for rdma_resolve_addr
The local loopback case uses PRs? Yes - the rdma cm makes no distinction when resolve route is called. It does a PR query. Even so, it still seems OK to me: Path: addr4_resolve_remote $ ip route get 10.0.0.11 from 192.168.122.1 local 10.0.0.11 from 192.168.122.1 dev lo srcIP = 192.168.122.1 rdma_translate_ip(dst_ip = 10.0.0.11) rdma_copy_addr(eth0); src_dev_addr = eth0.dev_addr (ie GID of 10.0.0.11) memcpy(dst_dev_addr = src_dev_addr) (ie GID of 10.0.0.11) So everthing is bound to the GID of 10.0.0.11 which matches the listen of 10.0.0.11, which seems OK. The source could have called rdma_bind_addr(192.168.122.1) prior to calling rdma_resolve_addr(). (DAPL does this.) This would have returned a different RDMA device than binding to 10.0.0.11. The client app could have allocated resources on that device, but the CM REQ will carry the gid/lid of the other device. The endpoints won't be able to communicate. Yes, it's weird, and may not be optimal, but if a source address is explicitly given, then its mapping to a specific RDMA device should be honored. - Sean ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Re: [PATCH] link-local address fix for rdma_resolve_addr
On Wed, Oct 21, 2009 at 05:40:30PM -0700, Sean Hefty wrote: Even so, it still seems OK to me: Path: addr4_resolve_remote $ ip route get 10.0.0.11 from 192.168.122.1 local 10.0.0.11 from 192.168.122.1 dev lo srcIP = 192.168.122.1 rdma_translate_ip(dst_ip = 10.0.0.11) rdma_copy_addr(eth0); src_dev_addr = eth0.dev_addr (ie GID of 10.0.0.11) memcpy(dst_dev_addr = src_dev_addr) (ie GID of 10.0.0.11) So everthing is bound to the GID of 10.0.0.11 which matches the listen of 10.0.0.11, which seems OK. The source could have called rdma_bind_addr(192.168.122.1) prior to calling rdma_resolve_addr(). (DAPL does this.) This would have returned a different RDMA device than binding to 10.0.0.11. The client app could have allocated resources on that device, but the CM REQ will carry the gid/lid of the other device. The endpoints won't be able to communicate. That is very difficult to fit into the semantics the IP routing model uses :( And it looks like an API problem in DAPL :( So, I see now, you are proposing that in this case the connection attempt to be routed through the network and not looped back.. I actually have a big problem with that, ignoring a 'lo' entry in a routing table is very much not IP like and not a good idea. That should be respected.. I guess I'd much rather see that one situation return EHOSTUNREACH or something. But, I suppose you are going to tell me that Intel MPI uses DAPL to loopback connect to other processes on the same node, and relies on this? :( :( :( Sigh. Anyhow, lets not get side tracked. It seems to me, the easy way out for David's approach is to simply check if the device is already bound via rdma_bind() and if so force it to that device no matter what the routing table lookup returns. Can you suggest a reliable way to make that check? [What happens now if I do this: rdma_bind(10.0.0.11) rdma_resolve_addr(src = 192.168.122.1 dst = 10.0.0.11) Does the cma_bind path check that it is already bound and give out an error? too late for me to check] Once the cma_bind for rdma_resolve_addr is moved into the addr_resolve_remote function then people using the API without calling bind on the client path will get sane IP-like behavior. Yes, it's weird, and may not be optimal, but if a source address is explicitly given, then its mapping to a specific RDMA device should be honored. Remember, on Linux the IP is *not* attached to a device, it is part of the host itself. So the idea that a source address somehow specifies a RDMA device does not fit into the Linux IP networking model. Unfortunately the definition of rdma_bind kinda bakes this mismatched model into the API :( Truth be told, to fit the Linux IP model, the RDMA CM should have provided exactly only two ways to bind a cm_id to a specific device - rdma_accept and rdma_resolve_addr. Jason ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] RE: [PATCH] link-local address fix for rdma_resolve_addr
That is very difficult to fit into the semantics the IP routing model uses :( And it looks like an API problem in DAPL :( This depends on if your view is that the rdma cm is trying to match IP routing, trying to use IP addresses as convenient names for RDMA ports, or something in between that may lean more one way or the other depending on the device type. IMO, the primary reason for using IP addressing over IB is more for convenience, than compliance. So, I see now, you are proposing that in this case the connection attempt to be routed through the network and not looped back.. I actually have a big problem with that, ignoring a 'lo' entry in a routing table is very much not IP like and not a good idea. That should be respected.. I hesitate comparing RDMA versus IP too closely. Sigh. Anyhow, lets not get side tracked. It seems to me, the easy way out for David's approach is to simply check if the device is already bound via rdma_bind() and if so force it to that device no matter what the routing table lookup returns. Can you suggest a reliable way to make that check? I'm not entirely sure of the best way to test this. Checking the src_dev_addr is my first thought. [What happens now if I do this: rdma_bind(10.0.0.11) rdma_resolve_addr(src = 192.168.122.1 dst = 10.0.0.11) Does the cma_bind path check that it is already bound and give out an error? too late for me to check] rdma_resolve_addr only calls bind if the rdma id is not already bound. The src_addr simply gets ignored in this case, and the bound address is used instead. Once the cma_bind for rdma_resolve_addr is moved into the addr_resolve_remote function then people using the API without calling bind on the client path will get sane IP-like behavior. I like the approach, I'm just concerned about the case where the app has requested a specific source address, which today means that a specific RDMA device should be used. Remember, on Linux the IP is *not* attached to a device, it is part of the host itself. So the idea that a source address somehow specifies a RDMA device does not fit into the Linux IP networking model. I would say we're extending the linux networking model to incorporate this concept, while staying in our imposed little RDMA sand box world. Even iWarp has these issues. Truth be told, to fit the Linux IP model, the RDMA CM should have provided exactly only two ways to bind a cm_id to a specific device - rdma_accept and rdma_resolve_addr. I think this is more restrictive than things need to be. - Sean ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Re: [PATCH] link-local address fix for rdma_resolve_addr
On Mon, Oct 19, 2009 at 03:47:09PM -0700, David J. Wilder wrote: +++ b/drivers/infiniband/core/addr.c @@ -278,6 +278,21 @@ static int addr6_resolve_remote(struct sockaddr_in6 *src_in, fl.nl_u.ip6_u.daddr = dst_in-sin6_addr; fl.nl_u.ip6_u.saddr = src_in-sin6_addr; + if (ipv6_addr_type(src_in-sin6_addr) IPV6_ADDR_LINKLOCAL) { + if (!src_in-sin6_scope_id) + return -EINVAL; + fl.oif = src_in-sin6_scope_id; + } Seeing it all together like this make it clear this test needs to move up the call chain and test the sockaddr passed from userspace, not the one created by addr_resolve_local. Probably somewhere along the rdma_resolve_addr - cma_bind_addr - rmda_bind_addr - rdma_translate_ip path. Maybe rdma_translate_ip should use and check the scope as a temporary hack? BTW, while researching the above comment, I'm not certain your last patch is at all correct: commit 85f20b39fd44310a163a9b33708fea57f08a4e40 RDMA/addr: Fix resolution of local IPv6 addresses This patch allows a local IPv6 address to be resolved by rdma_cm. To reproduce the problem: $ rping -s -v -a ::0 $ rping -c -v -a IPv6 address local to this system --- a/drivers/infiniband/core/addr.c +++ b/drivers/infiniband/core/addr.c @@ -393,7 +393,7 @@ static int addr_resolve_local(struct sockaddr *src_in, for_each_netdev(init_net, dev) if (ipv6_chk_addr(init_net, - ((struct sockaddr_in6 *) addr)-sin6_addr, + ((struct sockaddr_in6 *) dst_in)-sin6_addr, dev, 1)) break; I can believe it fixes the case you describe (ie loopback) but matching the *dest* IP against the local interface's IP list cannot possibly be right. The primary problem is that for_each_netdev/ipv6_chk_addr is NOT the same as ip_dev_find. ip_dev_find is a routing lookup, ipv6_chk_addr compares the local address list. Not at all the same. I don't see a route lookup helper for ipv6, so you have to code full flowi lookup. With your change I expect ipv6 is 100% broken now for non loop cases? Regards, Jason ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] RE: [PATCH] link-local address fix for rdma_resolve_addr
@@ -393,7 +393,7 @@ static int addr_resolve_local(struct sockaddr *src_in, for_each_netdev(init_net, dev) if (ipv6_chk_addr(init_net, - ((struct sockaddr_in6 *) addr)- sin6_addr, + ((struct sockaddr_in6 *) dst_in)- sin6_addr, dev, 1)) break; I can believe it fixes the case you describe (ie loopback) but matching the *dest* IP against the local interface's IP list cannot possibly be right. The intent is to see if the destination address is local. A source address may not be given. - Sean ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Re: [PATCH] link-local address fix for rdma_resolve_addr
On Mon, Oct 19, 2009 at 04:47:59PM -0700, Sean Hefty wrote: @@ -393,7 +393,7 @@ static int addr_resolve_local(struct sockaddr *src_in, for_each_netdev(init_net, dev) if (ipv6_chk_addr(init_net, - ((struct sockaddr_in6 *) addr)- sin6_addr, + ((struct sockaddr_in6 *) dst_in)- sin6_addr, dev, 1)) break; I can believe it fixes the case you describe (ie loopback) but matching the *dest* IP against the local interface's IP list cannot possibly be right. The intent is to see if the destination address is local. A source address may not be given. Well, that makes more sense, but it still pretty strange to match the IP list like that, the proper thing is to query RT6_TABLE_LOCAL, like the IPv4 case does. Anyhow, couldn't the whole addr_resolve_local routine be replaced with something like this in addr_resolve_remote: if (rt-idev == init_net-loopback_dev) rdma_translate_ip(rt-rt_src, dev_addr, NULL); for IPv4 and similar for IPv6? That does query the proper RT_TABLEs to determine if the IP is local and then we get the searching and ip_dev_find only for the case where the address is definitely looped back. Much closer to how the IP stack works normally. Jason ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Re: [PATCH] link-local address fix for rdma_resolve_addr
On Thu, 2009-10-15 at 15:32 -0600, Jason Gunthorpe wrote: The actual fixing to the code is not hard, remove rdma_translate_ip, addr_resolve_local, split addr_resolve_remote into a part to resolve the route and a part that does the arp/nd. Make the route resolve part work almost exactly like addr4_resolve_remote (noting that the v6 version is wrong, since is doesn't respect unset source addres, another bug). Call rdma_copy_addr based on the rt-idev-dev (or should it be odev??). Do the ARP. The pain is in retesting everything :| Jason There is the wrinkle in this plan. If you pass ip6_route_output() your own link-local address (scoped or not) it returns a neighbor entry bound to the loop back device. $ ip route get fe80::202:c903:1:28ed oif ib0 local fe80::202:c903:1:28ed from :: via :: dev lo table local proto none src fe80::202:c903:1:28ed metric 0 mtu 16436 advmss 16376 hoplimit 4294967295 Thats not going to work as lo has no RDMA support. I suspect this is why addr_resolve_local() was added in the first place :) I will post a patch with our first solution and include the fix to addr_resolve_local(). Dave. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Re: [PATCH] link-local address fix for rdma_resolve_addr
On Fri, Oct 16, 2009 at 11:54:39AM -0700, David J. Wilder wrote: There is the wrinkle in this plan. If you pass ip6_route_output() your own link-local address (scoped or not) it returns a neighbor entry bound to the loop back device. Well, that really is correct. The RDMA CM could try to do something smart if lo is returned from the route lookup - ie choose the first RDMA device bound to that IP or something. There are other problems with the loop back case, like the Linux stack doesn't generate neighbor entries for its own IP addresses. The loopback case is far less important than the other policy routing features anyhow.. I will post a patch with our first solution and include the fix to addr_resolve_local(). May as well, a little bit of progress is better than none at all. Also note that I observed that the IPv6 code path never sets the source address from the routing table if IN6_ADDR_ANY is used as source, this is certainly wrong and could cause other problems down the line, probably more so for iwarp than IB though.. Jason ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Re: [PATCH] link-local address fix for rdma_resolve_addr
On Wed, 2009-10-14 at 13:09 Jason Gunthorpe wrote: So, it tries to match the source addr to the addrs bound to the device, which is wrong - that isn't how the ip stack works. You can patch this up a little bit by fixing up addr_resolve_local to set sin6_scope_ip. I found the bug in addr_resolve_local(). (more comments below) --- addr.c.1759 2009-10-13 15:57:48.0 -0500 +++ addr.c.ip_local 2009-10-15 14:03:50.0 -0500 @@ -390,14 +390,17 @@ static int addr_resolve_local(struct soc case AF_INET6: { struct in6_addr *a; + int found = 0; for_each_netdev(init_net, dev) if (ipv6_chk_addr(init_net, ((struct sockaddr_in6 *) dst_in)-sin6_addr, - dev, 1)) + dev, 1)){ + found = 1; break; + } - if (!dev) + if (!found) return -EADDRNOTAVAIL; a = ((struct sockaddr_in6 *) src_in)-sin6_addr; @@ -406,6 +409,8 @@ static int addr_resolve_local(struct soc src_in-sa_family = dst_in-sa_family; ((struct sockaddr_in6 *) src_in)-sin6_addr = ((struct sockaddr_in6 *) dst_in)-sin6_addr; + ((struct sockaddr_in6 *) src_in)-sin6_scope_id = +((struct sockaddr_in6 *) dst_in)-sin6_scope_id; ret = rdma_copy_addr(addr, dev, dev-dev_addr); } else if (ipv6_addr_loopback(a)) { ret = rdma_translate_ip(dst_in, addr); But really the correct thing to do is to remove addr_resolve_local and place the source address into the struct flowi and use the result of the route lookup to bind to the source device, and set the source address if it is unset. Sorry I don't get it.. Are you saying that ip6_route_output() will resolve the address even if it is a link-local address bound to my own interface? Therefor addr_resolve_local() is not needed. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Re: [PATCH] link-local address fix for rdma_resolve_addr
On Thu, Oct 15, 2009 at 12:27:21PM -0700, David J. Wilder wrote: On Wed, 2009-10-14 at 13:09 Jason Gunthorpe wrote: So, it tries to match the source addr to the addrs bound to the device, which is wrong - that isn't how the ip stack works. You can patch this up a little bit by fixing up addr_resolve_local to set sin6_scope_ip. I found the bug in addr_resolve_local(). (more comments below) Yes, that is the hacky work around I was mentioning.. But really the correct thing to do is to remove addr_resolve_local and place the source address into the struct flowi and use the result of the route lookup to bind to the source device, and set the source address if it is unset. Sorry I don't get it.. Are you saying that ip6_route_output() will resolve the address even if it is a link-local address bound to my own interface? Therefor addr_resolve_local() is not needed. Yes, and more. In Linux the routing table takes as input the source (optional), device (optional) and destination address and returns as output the device to use. To determine the device to bind to you ask the routing table what device to use for all the route information you have. For example: $ ip route get fe80::c2d from fe80::213:72ff:fe29:e65d oif eth0 fe80::c2d via fe80::c2d dev eth0 src fe80::213:72ff:fe29:e65d metric 0 cache mtu 1500 advmss 1440 hoplimit 4294967295 $ ip route get fe80::c2d oif eth0 fe80::c2d via fe80::c2d dev eth0 src fe80::213:72ff:fe29:e65d metric 0 cache mtu 1500 advmss 1440 hoplimit 4294967295 You can see in both cases the routing table returns a 'src' entry. 'src' is the address to bind to if no bind address was specified. When doing link local addresess the sin6_scope_id should sets the 'oif' key in the routing lookup, which will result in the correct src address and output device being selected by the routing algorithm. For instance on my machine here, I have two interfaces: $ ip route get fe80::c2d oif virbr0 fe80::c2d via fe80::c2d dev virbr0 src fe80::2c5d:c4ff:feb8:1ce5 metric 0 cache mtu 1500 advmss 1440 hoplimit 4294967295 As you can see it is returning the link local address for virbr0 as the source. So the algorithm in RDMA CM should look like this: - If src is specified then set the bind local address to src [if src is link local then it must specify sin6_scope_id, and sin6_scope_id becomes the oif input to the route lookup] - If dst is link local then its sin6_scope_id is the oif to the route lookup (and must match src, as we did last go round) - Src (or 'any'), dst and device (or 'any') are passed to the route lookup - The RDMA CM ID is bound to the device returned by the route lookup - If the src address was not specified then the connection source IP is set to the 'src' value from the route lookup. This is why addr_resolve_local/rdma_translate_ip is not needed, that entire entire function is done by the routing table code. You can see why this becomes important when it is combined with policy routing, for instance consider this example: $ ip rule 32765: from 10.0.0.4 lookup dnat $ ip route show table dnat default via 10.0.0.1 dev eth1 $ ip route get 10.0.0.100 10.0.0.100 dev eth0 src 10.0.0.2 $ ip route get 10.0.0.100 from 10.0.0.4 10.0.0.100 from 10.0.0.4 via 10.0.0.1 dev eth1 cache mtu 1500 advmss 1460 hoplimit 64 The two results are radically different and dependant on the source address. (10.0.0.4 could be attached to eth0, and eth1!) The actual fixing to the code is not hard, remove rdma_translate_ip, addr_resolve_local, split addr_resolve_remote into a part to resolve the route and a part that does the arp/nd. Make the route resolve part work almost exactly like addr4_resolve_remote (noting that the v6 version is wrong, since is doesn't respect unset source addres, another bug). Call rdma_copy_addr based on the rt-idev-dev (or should it be odev??). Do the ARP. The pain is in retesting everything :| Jason ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Re: [PATCH] link-local address fix for rdma_resolve_addr
Sean- This patch should fix the behavior of rdma_resolve_addr when using link-local addressing. On Tue, 2009-10-13 at 17:12 -0600, Jason Gunthorpe wrote: On Tue, Oct 13, 2009 at 03:09:40PM -0700, David J. Wilder wrote: Here is a patch to addr6_resolve_remote() to correctly handle link-local address. It should cover all the conditions Jason described. Looks pretty good to me, definitely on the right track. Hmm.. Actually, upon comparing to tcp_ipv6.c, I'd say one more behavior is necessary. The code in tcp_ipv6 allows the destination to not specify a scope id if an interface has already been set. Looks like the two ways to set an interface ID are to use bind() or SO_BINDTODEVICE.. Specifying a source address to RDMA CM is similar to bind(), so if the source address is link local it must specify a sin6_scope_id and the dest address can specify 0, or the same value. Jason - This new patch should closely emulate tcp_ipv6.c. when both source and destination scope_ids are given with link-local address. With this patch rping works as expected. Link-local with scope: # /usr/bin/rping -c -a fe80::202:c903:1:1925%ib0 Link-local w/out scope: # /usr/bin/rping -c -a fe80::202:c903:1:1925 rdma_resolve_addr error -1 Other ipv6 address: # /usr/bin/rping -c -a 2001:db8:1234::2 (server side started with rping -s -P -v -a ::0) Signed-off-by: David Wilder dwil...@us.ibm.com - --- drivers/infiniband/core/addr.c.1759 2009-10-13 15:57:48.0 -0500 +++ drivers/infiniband/core/addr.c 2009-10-14 10:47:56.0 -0500 @@ -278,6 +278,21 @@ static int addr6_resolve_remote(struct s fl.nl_u.ip6_u.daddr = dst_in-sin6_addr; fl.nl_u.ip6_u.saddr = src_in-sin6_addr; + if (ipv6_addr_type(dst_in-sin6_addr) IPV6_ADDR_LINKLOCAL){ + // Link-local address require an interface to be specified. + if (!(dst_in-sin6_scope_id||src_in-sin6_scope_id)) + return -EINVAL; + + // If src and dst interfaces are supplied they must match. + if ( (dst_in-sin6_scope_id src_in-sin6_scope_id) + (src_in-sin6_scope_id != dst_in-sin6_scope_id) ) + return -EINVAL; + if ( dst_in-sin6_scope_id ) + fl.oif = dst_in-sin6_scope_id; + else + fl.oif = src_in-sin6_scope_id; + } + dst = ip6_route_output(init_net, NULL, fl); if (!dst) return ret; ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Re: [PATCH] link-local address fix for rdma_resolve_addr
On Wed, Oct 14, 2009 at 09:23:57AM -0700, David J. Wilder wrote: This new patch should closely emulate tcp_ipv6.c. when both source and destination scope_ids are given with link-local address. Maybe like this: fl.oif = 0; if (ipv6_addr_type(src_in-sin6_addr) IPV6_ADDR_LINKLOCAL) { if (!src_in-sin6_scope_id) return -EINVAL; fl.oif = src_in-sin6scope_id; } if (ipv6_addr_type(dst_in-sin6_addr) IPV6_ADDR_LINKLOCAL){ if (dst_in.sin6_scope_id) { if (fl.oif fl.oif != dst_in.sin6_scope_id) return -EINVAL; fl.oif = dst_in.sin6_scope_id; } if (!fl.oif) return -EINVAL; } src and dest have to be tested seperately, the TCPv6 versions don't require them to be both link local. Don't forget to run checkpatch ;) Regards, ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Re: [PATCH] link-local address fix for rdma_resolve_addr
On Wed, 2009-10-14 at 11:01 -0600, Jason Gunthorpe wrote: On Wed, Oct 14, 2009 at 09:23:57AM -0700, David J. Wilder wrote: This new patch should closely emulate tcp_ipv6.c. when both source and destination scope_ids are given with link-local address. Maybe like this: fl.oif = 0; if (ipv6_addr_type(src_in-sin6_addr) IPV6_ADDR_LINKLOCAL) { if (!src_in-sin6_scope_id) return -EINVAL; fl.oif = src_in-sin6scope_id; } if (ipv6_addr_type(dst_in-sin6_addr) IPV6_ADDR_LINKLOCAL){ if (dst_in.sin6_scope_id) { if (fl.oif fl.oif != dst_in.sin6_scope_id) return -EINVAL; fl.oif = dst_in.sin6_scope_id; } if (!fl.oif) return -EINVAL; } src and dest have to be tested seperately, the TCPv6 versions don't require them to be both link local. Don't forget to run checkpatch ;) This looks good. Once concern, it may be obtuse, but if both the src and dst are link-local addresses should only one need to be scoped? This patch will required the src to always be scoped when using link local. Dave... ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Re: [PATCH] link-local address fix for rdma_resolve_addr
On Wed, Oct 14, 2009 at 10:30:05AM -0700, David J. Wilder wrote: This looks good. Once concern, it may be obtuse, but if both the src and dst are link-local addresses should only one need to be scoped? This patch will required the src to always be scoped when using link local. The TCPv6 code requires the src to be scoped (or SO_BINDTODEVICE to be used prior). If src is scoped then it is optional to scope dest, but if dest is scoped then it must match. I intended to capture that logic.. Can you give a quick test and send a git format-patch/checkpatch'd patch to Roland, with your signed off line, etc. You can put a Reviewed-by line from me as well if you like. Thanks, Jason ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Re: [PATCH] link-local address fix for rdma_resolve_addr
On Wed, 2009-10-14 at 11:01 -0600, Jason Gunthorpe wrote: On Wed, Oct 14, 2009 at 09:23:57AM -0700, David J. Wilder wrote: This new patch should closely emulate tcp_ipv6.c. when both source and destination scope_ids are given with link-local address. Maybe like this: fl.oif = 0; if (ipv6_addr_type(src_in-sin6_addr) IPV6_ADDR_LINKLOCAL) { if (!src_in-sin6_scope_id) return -EINVAL; fl.oif = src_in-sin6scope_id; } if (ipv6_addr_type(dst_in-sin6_addr) IPV6_ADDR_LINKLOCAL){ if (dst_in.sin6_scope_id) { if (fl.oif fl.oif != dst_in.sin6_scope_id) return -EINVAL; fl.oif = dst_in.sin6_scope_id; } if (!fl.oif) return -EINVAL; } src and dest have to be tested seperately, the TCPv6 versions don't require them to be both link local. Hum.. this change is not working with rping. The src address has been filled in already with a link-local address and no scope_id so it is returning -EINVAL. Another bug I guess I am doing some more debug. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Re: [PATCH] link-local address fix for rdma_resolve_addr
On Wed, Oct 14, 2009 at 12:33:17PM -0700, David J. Wilder wrote: On Wed, 2009-10-14 at 11:01 -0600, Jason Gunthorpe wrote: On Wed, Oct 14, 2009 at 09:23:57AM -0700, David J. Wilder wrote: This new patch should closely emulate tcp_ipv6.c. when both source and destination scope_ids are given with link-local address. Maybe like this: fl.oif = 0; if (ipv6_addr_type(src_in-sin6_addr) IPV6_ADDR_LINKLOCAL) { if (!src_in-sin6_scope_id) return -EINVAL; fl.oif = src_in-sin6scope_id; } if (ipv6_addr_type(dst_in-sin6_addr) IPV6_ADDR_LINKLOCAL){ if (dst_in.sin6_scope_id) { if (fl.oif fl.oif != dst_in.sin6_scope_id) return -EINVAL; fl.oif = dst_in.sin6_scope_id; } if (!fl.oif) return -EINVAL; } src and dest have to be tested seperately, the TCPv6 versions don't require them to be both link local. Hum.. this change is not working with rping. The src address has been filled in already with a link-local address and no scope_id so it is returning -EINVAL. Another bug I guess I am doing some more debug. Ah feh, you are running into that other bug.. http://lists.openfabrics.org/pipermail/general/2009-July/060612.html Basically it does this: int rdma_resolve_addr(struct rdma_cm_id *id, struct sockaddr *src_addr, struct sockaddr *dst_addr, int timeout_ms) ret = rdma_resolve_ip(addr_client, (struct sockaddr *) id-route.addr.src_addr, dst_addr, id-route.addr.dev_addr, timeout_ms, addr_handler, id_priv); int rdma_resolve_ip(struct rdma_addr_client *client, struct sockaddr *src_addr, struct sockaddr *dst_addr, struct rdma_dev_addr *addr, int timeout_ms, void (*callback)(int status, struct sockaddr *src_addr, struct rdma_dev_addr *addr, void *context), void *context) req-status = addr_resolve_local(src_in, dst_in, addr); static int addr_resolve_local(struct sockaddr *src_in, struct sockaddr *dst_in, struct rdma_dev_addr *addr) dev = ip_dev_find(init_net, dst_ip); [..] for_each_netdev(init_net, dev) if (ipv6_chk_addr(init_net, ((struct sockaddr_in6 *) addr)-sin6_addr, dev, 1)) break; So, it tries to match the source addr to the addrs bound to the device, which is wrong - that isn't how the ip stack works. You can patch this up a little bit by fixing up addr_resolve_local to set sin6_scope_ip. But really the correct thing to do is to remove addr_resolve_local and place the source address into the struct flowi and use the result of the route lookup to bind to the source device, and set the source address if it is unset. Jason ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Re: [PATCH] link-local address fix for rdma_resolve_addr
On Tue, Oct 13, 2009 at 03:09:40PM -0700, David J. Wilder wrote: Here is a patch to addr6_resolve_remote() to correctly handle link-local address. It should cover all the conditions Jason described. Looks pretty good to me, definitely on the right track. Hmm.. Actually, upon comparing to tcp_ipv6.c, I'd say one more behavior is necessary. The code in tcp_ipv6 allows the destination to not specify a scope id if an interface has already been set. Looks like the two ways to set an interface ID are to use bind() or SO_BINDTODEVICE.. Specifying a source address to RDMA CM is similar to bind(), so if the source address is link local it must specify a sin6_scope_id and the dest address can specify 0, or the same value. How af_inet6 handles bind(): if (addr_type IPV6_ADDR_LINKLOCAL) { if (addr_len = sizeof(struct sockaddr_in6) addr-sin6_scope_id) { /* Override any existing binding, if another one * is supplied by user. */ sk-sk_bound_dev_if = addr-sin6_scope_id; } /* Binding to link-local address requires an interface */ if (!sk-sk_bound_dev_if) { err = -EINVAL; goto out; } dev = dev_get_by_index(net, sk-sk_bound_dev_if); if (!dev) { err = -ENODEV; goto out; } } How tcp_ipv6 checks the destination address: if (addr_typeIPV6_ADDR_LINKLOCAL) { if (addr_len = sizeof(struct sockaddr_in6) usin-sin6_scope_id) { /* If interface is set while binding, indices * must coincide. */ if (sk-sk_bound_dev_if sk-sk_bound_dev_if != usin-sin6_scope_id) return -EINVAL; sk-sk_bound_dev_if = usin-sin6_scope_id; } /* Connect to link-local address requires an interface */ if (!sk-sk_bound_dev_if) return -EINVAL; } Jason ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg