Re: [OMPI devel] [OMPI svn] svn:open-mpi r17307
Adrian, For the most part this seems to work for me. But there are a few issues. I'm not sure which are introduced by this patch, and whether some may be expected behavior. But for completeness I will point them all out. First, let me explain I am working on a machine with 3 tcp interfaces, lo, eth0, and ib0. Both eth0 and ib0 connect all the compute nodes. 1. There are some warnings when compiling: btl_tcp_proc.c:171: warning: no previous prototype for 'evaluate_assignment' btl_tcp_proc.c:206: warning: no previous prototype for 'visit' btl_tcp_proc.c:224: warning: no previous prototype for 'mca_btl_tcp_initialise_interface' btl_tcp_proc.c: In function `mca_btl_tcp_proc_insert': btl_tcp_proc.c:304: warning: pointer targets in passing arg 2 of `opal_ifindextomask' differ in signedness btl_tcp_proc.c:313: warning: pointer targets in passing arg 2 of `opal_ifindextomask' differ in signedness btl_tcp_proc.c:389: warning: comparison between signed and unsigned btl_tcp_proc.c:400: warning: comparison between signed and unsigned btl_tcp_proc.c:401: warning: comparison between signed and unsigned btl_tcp_proc.c:459: warning: ISO C90 forbids variable-size array `a' btl_tcp_proc.c:459: warning: ISO C90 forbids mixed declarations and code btl_tcp_proc.c:465: warning: ISO C90 forbids mixed declarations and code btl_tcp_proc.c:466: warning: comparison between signed and unsigned btl_tcp_proc.c:480: warning: comparison between signed and unsigned btl_tcp_proc.c:485: warning: comparison between signed and unsigned btl_tcp_proc.c:495: warning: comparison between signed and unsigned 2. If I exclude all my tcp interfaces, the connection fails properly, but I do get a malloc request for 0 bytes: tprins@odin examples]$ mpirun -mca btl tcp,self -mca btl_tcp_if_exclude eth0,ib0,lo -np 2 ./ring_c malloc debug: Request for 0 bytes (btl_tcp_component.c, 844) malloc debug: Request for 0 bytes (btl_tcp_component.c, 844) 3. If the exclude list does not contain 'lo', or the include list contains 'lo', the job hangs when using multiple nodes: [tprins@odin examples]$ mpirun -mca btl tcp,self -mca btl_tcp_if_exclude ib0 -np 2 -bynode ./ring_cProcess 0 sending 10 to 1, tag 201 (2 processes in ring) [odin011][1,0][btl_tcp_endpoint.c:619:mca_btl_tcp_endpoint_complete_connect] connect() failed: Connection refused (111) [tprins@odin examples]$ mpirun -mca btl tcp,self -mca btl_tcp_if_include eth0,lo -np 2 -bynode ./ring_c Process 0 sending 10 to 1, tag 201 (2 processes in ring) [odin011][1,0][btl_tcp_endpoint.c:619:mca_btl_tcp_endpoint_complete_connect] connect() failed: Connection refused (111) However, the great news about this patch is that it appears to fix https://svn.open-mpi.org/trac/ompi/ticket/1027 for me. Hope this helps, Tim Adrian Knoth wrote: On Wed, Jan 30, 2008 at 06:48:54PM +0100, Adrian Knoth wrote: What is the real issue behind this whole discussion? Hanging connections. I'll have a look at it tomorrow. To everybody who's interested in BTL-TCP, especially George and (to a minor degree) rhc: I've integrated something what I call "magic address selection code". See the comments in r17348. Can you check https://svn.open-mpi.org/svn/ompi/tmp-public/btl-tcp if it's working for you? Read: multi-rail TCP, FNN, whatever is important to you? The code is proof of concept and could use a little tuning (if it's working at all. Over here, it satisfies all tests). I vaguely remember that at least Ralph doesn't like int a[perm_size * sizeof(int)]; where perm_size is dynamically evaluated (read: array size is runtime dependent) There are also some large arrays, search for MAX_KERNEL_INTERFACE_INDEX. Perhaps it's better to replace them with an appropriate OMPI data structure. I don't know what fits best, you guys know the details... So please give the code a try, and if it's working, feel free to cleanup whatever is necessary to make it the OMPI style or give me some pointers what to change. I'd like to point to Thomas' diploma thesis. The PDF explains the theory behind the code, it's like an rationale. Unfortunately, the PDF has some typos, but I guess you'll get the idea. It's a graph matching algorithm, Chapter 3 covers everything in detail: http://cluster.inf-ra.uni-jena.de/~adi/peiselt-thesis.pdf HTH
Re: [OMPI devel] [OMPI svn] svn:open-mpi r17307
On Wed, Jan 30, 2008 at 06:48:54PM +0100, Adrian Knoth wrote: > > What is the real issue behind this whole discussion? > Hanging connections. > I'll have a look at it tomorrow. To everybody who's interested in BTL-TCP, especially George and (to a minor degree) rhc: I've integrated something what I call "magic address selection code". See the comments in r17348. Can you check https://svn.open-mpi.org/svn/ompi/tmp-public/btl-tcp if it's working for you? Read: multi-rail TCP, FNN, whatever is important to you? The code is proof of concept and could use a little tuning (if it's working at all. Over here, it satisfies all tests). I vaguely remember that at least Ralph doesn't like int a[perm_size * sizeof(int)]; where perm_size is dynamically evaluated (read: array size is runtime dependent) There are also some large arrays, search for MAX_KERNEL_INTERFACE_INDEX. Perhaps it's better to replace them with an appropriate OMPI data structure. I don't know what fits best, you guys know the details... So please give the code a try, and if it's working, feel free to cleanup whatever is necessary to make it the OMPI style or give me some pointers what to change. I'd like to point to Thomas' diploma thesis. The PDF explains the theory behind the code, it's like an rationale. Unfortunately, the PDF has some typos, but I guess you'll get the idea. It's a graph matching algorithm, Chapter 3 covers everything in detail: http://cluster.inf-ra.uni-jena.de/~adi/peiselt-thesis.pdf HTH -- Cluster and Metacomputing Working Group Friedrich-Schiller-Universität Jena, Germany private: http://adi.thur.de
Re: [OMPI devel] [OMPI svn] svn:open-mpi r17307
On Wed, Jan 30, 2008 at 03:38:00PM +0100, Bogdan Costescu wrote: > The results is that, with the default Linux kernel settings, there is > no way to tell which way a connection will take in a multi-rail TCP/IP > setup. Even more, when the ARP cache expires and a new ARP request is > made, the answer (MAC address) from the target/destination could be > different, so that from that moment on the connection could switch to > a different media. I've tested this recently with the RHEL5 kernels > with one gigabit and one Myri-10G connection, seeing a TCP stream > switching randomly between the gigabit and the Myri-10G connection. That's weird. I've never seen this, but according to the various ARP settings in the Linux kernel, I could imagine such a scenario. IPv6 doesn't use ARP, but neighbourhood discovery. It's completely different, and I hope it behaves "link local". It's a whole protocol ("ICMPv6"), so things might be better. JFTR: http://www-uxsup.csx.cam.ac.uk/courses/ipv6_basics/x84.html -- Cluster and Metacomputing Working Group Friedrich-Schiller-Universität Jena, Germany private: http://adi.thur.de
Re: [OMPI devel] [OMPI svn] svn:open-mpi r17307
On Wed, Jan 30, 2008 at 12:05:50PM -0500, George Bosilca wrote: > What is the real issue behind this whole discussion? Hanging connections. See https://svn.open-mpi.org/trac/ompi/ticket/1206 The multi-address peer tries to connect, but btl_tcp_proc_accept denies due to not matching addresses. (less btl_endpoints than possible source addresses) r17331 and r17332 haven't fixed the issue. Don't code when leaving the office ;) I'll have a look at it tomorrow. Sorry for all the noise in the trunk. > multiple IP addresses by interface the connection step will work. Now > I can see a benefit of having multiple socket over the same link (and > it's already implemented in Open MPI), but I don't see the interest of > using multiple IP in this case. I have an easy to reproduce testcase for #1206. If you like, we can step through the debugger in a shared screen (screen -x) or VNC session. Just mail me if you're interested. ;) -- Cluster and Metacomputing Working Group Friedrich-Schiller-Universität Jena, Germany private: http://adi.thur.de
Re: [OMPI devel] [OMPI svn] svn:open-mpi r17307
What is the real issue behind this whole discussion ? With one or multiple IP addresses by interface the connection step will work. Now I can see a benefit of having multiple socket over the same link (and it's already implemented in Open MPI), but I don't see the interest of using multiple IP in this case. george. On Jan 30, 2008, at 9:37 AM, Jeff Squyres wrote: Is one possible solution to have Open MPI mark in the packet where the incoming connection is coming from? On Jan 30, 2008, at 9:20 AM, Tim Mattox wrote: Hello, On Jan 30, 2008 3:17 AM, Adrian Knothwrote: [snip] As mentioned earlier: it's very common to have multiple addresses per interface, and it's the kernel who assigns the source address, so there's nothing one could say about an incoming connection. Only that it could be any of all exported addresses. Any. This is only partially correct. Yes, by default the Linux kernel will fill in the IP header with any of the IP addresses associated with the machine, regardless of which NIC the packet will be sent on. It was a never ending debate on the Linux Kernel Mailing list as to what was the right way to do things... are IP addresses "owned" by the machine, or are they "owned" by the NIC? The kernel defaults to the former definition (which is contrary to pretty much every other OS on the planet... but the relevant RFCs left both interpretations open). Anyway, there are ways to configure the networking stack of the Linux kernel to get the other behavior, so that a packet will be guaranteed to have one of the IP addresses associated with the NIC that it uses for egress. See Documentation/networking/ip-sysctl.txt in your Linux Kernel sources for a description of these relevant options: arp_filter, arp_announce, arp_ignore which are accessed on a live system here: /proc/sys/net/ipv4/conf/all/ I guess if I put in the time, I could create a FAQ entry about it, and what values to use... though I am not familiar with any equivalent IPv6 settings (or if any exist). -- Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/ tmat...@gmail.com || timat...@open-mpi.org I'm a bright... http://www.the-brights.net/ ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel smime.p7s Description: S/MIME cryptographic signature
Re: [OMPI devel] [OMPI svn] svn:open-mpi r17307
On Wed, 30 Jan 2008, Adrian Knoth wrote: let me point out that the assumption "One interface, one address" isn't true. Strictly speaking when configuring one address to one interface: For Linux, "one interface, one address" isn't true; for Solaris and other Unices it is. The standards allow either of "assign addresses to interfaces" (Solaris way) and "assign addresses to hosts" (Linux way) to be used; this is a decision of the kernel network stack writers and can't be changed, however there are ways to configure the Linux stack to behave similar to the Solaris one from this point of view by limiting the ARP behaviour. The decision to assign addresses to hosts in Linux was made such that there is a better chance of reaching the host in case of misconfiguration or network problems. Indeed, even if an interface is down (f.e. cable unplugged or 'ifconfig ethX down'), the address is reachable via other interfaces, as a new ARP association is made between the IP address and the MAC address of the interface which is used. As mentioned earlier: it's very common to have multiple addresses per interface That's the other case: an interface could have several addresses configured for it, f.e. via repeated 'ip add ... dev ethX', but this just adds to the number of addresses assigned to the host. The results is that, with the default Linux kernel settings, there is no way to tell which way a connection will take in a multi-rail TCP/IP setup. Even more, when the ARP cache expires and a new ARP request is made, the answer (MAC address) from the target/destination could be different, so that from that moment on the connection could switch to a different media. I've tested this recently with the RHEL5 kernels with one gigabit and one Myri-10G connection, seeing a TCP stream switching randomly between the gigabit and the Myri-10G connection. -- Bogdan Costescu IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany Phone: +49 6221 54 8869/8240, Fax: +49 6221 54 8868/8850 E-mail: bogdan.coste...@iwr.uni-heidelberg.de
Re: [OMPI devel] [OMPI svn] svn:open-mpi r17307
Is one possible solution to have Open MPI mark in the packet where the incoming connection is coming from? On Jan 30, 2008, at 9:20 AM, Tim Mattox wrote: Hello, On Jan 30, 2008 3:17 AM, Adrian Knothwrote: [snip] As mentioned earlier: it's very common to have multiple addresses per interface, and it's the kernel who assigns the source address, so there's nothing one could say about an incoming connection. Only that it could be any of all exported addresses. Any. This is only partially correct. Yes, by default the Linux kernel will fill in the IP header with any of the IP addresses associated with the machine, regardless of which NIC the packet will be sent on. It was a never ending debate on the Linux Kernel Mailing list as to what was the right way to do things... are IP addresses "owned" by the machine, or are they "owned" by the NIC? The kernel defaults to the former definition (which is contrary to pretty much every other OS on the planet... but the relevant RFCs left both interpretations open). Anyway, there are ways to configure the networking stack of the Linux kernel to get the other behavior, so that a packet will be guaranteed to have one of the IP addresses associated with the NIC that it uses for egress. See Documentation/networking/ip-sysctl.txt in your Linux Kernel sources for a description of these relevant options: arp_filter, arp_announce, arp_ignore which are accessed on a live system here: /proc/sys/net/ipv4/conf/all/ I guess if I put in the time, I could create a FAQ entry about it, and what values to use... though I am not familiar with any equivalent IPv6 settings (or if any exist). -- Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/ tmat...@gmail.com || timat...@open-mpi.org I'm a bright... http://www.the-brights.net/ ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems
Re: [OMPI devel] [OMPI svn] svn:open-mpi r17307
On Tue, Jan 29, 2008 at 07:37:42PM -0500, George Bosilca wrote: > The previous code was correct. Each IP address correspond to a > specific endpoint, and therefore to a specific BTL. This enable us to > have multiple TCP BTL at the same time, and allow the OB1 PML to > stripe the data over all of them. > > Unfortunately, your commit disable the multi-rail over TCP. Please > undo it. That's exactly what I had in mind when I said "this might break functionality". So we need as many endpoints as IP addresses? Then, simply connecting them leads to oversubscription: two parallel connections on the same media. That's where the kernel index enters the scene: we'll have to make sure not to open two parallel connections to the same remote kernel index. I'll revert the patch and come up with another solution, but for the moment, let me point out that the assumption "One interface, one address" isn't true. So, the previous code was also wrong. I hope not to run into model limitations: avoiding oversubscription means to keep the number of endpoints per peer lower than the amount of his interfaces, but accepting incoming connections from this peer means to have all his addresses (probably more than #remote_NICs) available in order to accept them. As mentioned earlier: it's very common to have multiple addresses per interface, and it's the kernel who assigns the source address, so there's nothing one could say about an incoming connection. Only that it could be any of all exported addresses. Any. -- Cluster and Metacomputing Working Group Friedrich-Schiller-Universität Jena, Germany private: http://adi.thur.de
Re: [OMPI devel] [OMPI svn] svn:open-mpi r17307
The previous code was correct. Each IP address correspond to a specific endpoint, and therefore to a specific BTL. This enable us to have multiple TCP BTL at the same time, and allow the OB1 PML to stripe the data over all of them. Unfortunately, your commit disable the multi-rail over TCP. Please undo it. Thanks, george. On Jan 29, 2008, at 10:55 AM, a...@osl.iu.edu wrote: Author: adi Date: 2008-01-29 10:55:56 EST (Tue, 29 Jan 2008) New Revision: 17307 URL: https://svn.open-mpi.org/trac/ompi/changeset/17307 Log: accept incoming connections from hosts with multiple addresses. We loop over all peer addresses and accept when one of them matches. Note that this might break functionality: mca_btl_tcp_proc_insert now always inserts the same endpoint. (is the lack of endpoints the problem? should there be one for every remote address?) Re #1206 Text files modified: trunk/ompi/mca/btl/tcp/btl_tcp_proc.c |12 ++-- 1 files changed, 6 insertions(+), 6 deletions(-) Modified: trunk/ompi/mca/btl/tcp/btl_tcp_proc.c = = = = = = = = == --- trunk/ompi/mca/btl/tcp/btl_tcp_proc.c (original) +++ trunk/ompi/mca/btl/tcp/btl_tcp_proc.c 2008-01-29 10:55:56 EST (Tue, 29 Jan 2008) @@ -327,16 +327,16 @@ { size_t i; OPAL_THREAD_LOCK(_proc->proc_lock); -for( i = 0; i < btl_proc->proc_endpoint_count; i++ ) { -mca_btl_base_endpoint_t* btl_endpoint = btl_proc- >proc_endpoints[i]; +for( i = 0; i < btl_proc->proc_addr_count; i++ ) { +mca_btl_tcp_addr_t* exported_address = btl_proc- >proc_addrs + i; /* Check all conditions before going to try to accept the connection. */ -if( btl_endpoint->endpoint_addr->addr_family != addr- >sa_family ) { +if( exported_address->addr_family != addr->sa_family ) { continue; } switch (addr->sa_family) { case AF_INET: -if( memcmp( _endpoint->endpoint_addr->addr_inet, +if( memcmp( _address->addr_inet, &(((struct sockaddr_in*)addr)->sin_addr), sizeof(struct in_addr) ) ) { continue; @@ -344,7 +344,7 @@ break; #if OPAL_WANT_IPV6 case AF_INET6: -if( memcmp( _endpoint->endpoint_addr->addr_inet, +if( memcmp( _address->addr_inet, &(((struct sockaddr_in6*)addr)->sin6_addr), sizeof(struct in6_addr) ) ) { continue; @@ -355,7 +355,7 @@ ; } -if(mca_btl_tcp_endpoint_accept(btl_endpoint, addr, sd)) { +if(mca_btl_tcp_endpoint_accept(btl_proc->proc_endpoints[0], addr, sd)) { OPAL_THREAD_UNLOCK(_proc->proc_lock); return true; } ___ svn mailing list s...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/svn smime.p7s Description: S/MIME cryptographic signature