Re: [OMPI devel] [OMPI svn] svn:open-mpi r17307

2008-02-01 Thread Tim Prins

Adrian,

For the most part this seems to work for me. But there are a few issues. 
I'm not sure which are introduced by this patch, and whether some may be 
expected behavior. But for completeness I will point them all out. 
First, let me explain I am working on a machine with 3 tcp interfaces, 
lo, eth0, and ib0. Both eth0 and ib0 connect all the compute nodes.


1. There are some warnings when compiling:
btl_tcp_proc.c:171: warning: no previous prototype for 'evaluate_assignment'
btl_tcp_proc.c:206: warning: no previous prototype for 'visit'
btl_tcp_proc.c:224: warning: no previous prototype for 
'mca_btl_tcp_initialise_interface'

btl_tcp_proc.c: In function `mca_btl_tcp_proc_insert':
btl_tcp_proc.c:304: warning: pointer targets in passing arg 2 of 
`opal_ifindextomask' differ in signedness
btl_tcp_proc.c:313: warning: pointer targets in passing arg 2 of 
`opal_ifindextomask' differ in signedness

btl_tcp_proc.c:389: warning: comparison between signed and unsigned
btl_tcp_proc.c:400: warning: comparison between signed and unsigned
btl_tcp_proc.c:401: warning: comparison between signed and unsigned
btl_tcp_proc.c:459: warning: ISO C90 forbids variable-size array `a'
btl_tcp_proc.c:459: warning: ISO C90 forbids mixed declarations and code
btl_tcp_proc.c:465: warning: ISO C90 forbids mixed declarations and code
btl_tcp_proc.c:466: warning: comparison between signed and unsigned
btl_tcp_proc.c:480: warning: comparison between signed and unsigned
btl_tcp_proc.c:485: warning: comparison between signed and unsigned
btl_tcp_proc.c:495: warning: comparison between signed and unsigned

2. If I exclude all my tcp interfaces, the connection fails properly, 
but I do get a malloc request for 0 bytes:
tprins@odin examples]$ mpirun -mca btl tcp,self  -mca btl_tcp_if_exclude 
eth0,ib0,lo -np 2 ./ring_c

malloc debug: Request for 0 bytes (btl_tcp_component.c, 844)
malloc debug: Request for 0 bytes (btl_tcp_component.c, 844)


3. If the exclude list does not contain 'lo', or the include list 
contains 'lo', the job hangs when using multiple nodes:
[tprins@odin examples]$ mpirun -mca btl tcp,self  -mca 
btl_tcp_if_exclude ib0 -np 2 -bynode ./ring_cProcess 0 sending 10 to 1, 
tag 201 (2 processes in ring)
[odin011][1,0][btl_tcp_endpoint.c:619:mca_btl_tcp_endpoint_complete_connect] 
connect() failed: Connection refused (111)


[tprins@odin examples]$ mpirun -mca btl tcp,self  -mca 
btl_tcp_if_include eth0,lo -np 2 -bynode ./ring_c

Process 0 sending 10 to 1, tag 201 (2 processes in ring)
[odin011][1,0][btl_tcp_endpoint.c:619:mca_btl_tcp_endpoint_complete_connect] 
connect() failed: Connection refused (111)



However, the great news about this patch is that it appears to fix 
https://svn.open-mpi.org/trac/ompi/ticket/1027 for me.


Hope this helps,

Tim



Adrian Knoth wrote:

On Wed, Jan 30, 2008 at 06:48:54PM +0100, Adrian Knoth wrote:


What is the real issue behind this whole discussion?

Hanging connections.
I'll have a look at it tomorrow.


To everybody who's interested in BTL-TCP, especially George and (to a
minor degree) rhc:

I've integrated something what I call "magic address selection code".
See the comments in r17348.

Can you check

   https://svn.open-mpi.org/svn/ompi/tmp-public/btl-tcp

if it's working for you? Read: multi-rail TCP, FNN, whatever is
important to you?


The code is proof of concept and could use a little tuning (if it's
working at all. Over here, it satisfies all tests).

I vaguely remember that at least Ralph doesn't like

   int a[perm_size * sizeof(int)];

where perm_size is dynamically evaluated (read: array size is runtime
dependent)

There are also some large arrays, search for MAX_KERNEL_INTERFACE_INDEX.
Perhaps it's better to replace them with an appropriate OMPI data
structure. I don't know what fits best, you guys know the details...


So please give the code a try, and if it's working, feel free to cleanup
whatever is necessary to make it the OMPI style or give me some pointers
what to change.


I'd like to point to Thomas' diploma thesis. The PDF explains the theory
behind the code, it's like an rationale. Unfortunately, the PDF has some
typos, but I guess you'll get the idea. It's a graph matching algorithm,
Chapter 3 covers everything in detail:

 http://cluster.inf-ra.uni-jena.de/~adi/peiselt-thesis.pdf


HTH





Re: [OMPI devel] [OMPI svn] svn:open-mpi r17307

2008-01-31 Thread Adrian Knoth
On Wed, Jan 30, 2008 at 06:48:54PM +0100, Adrian Knoth wrote:

> > What is the real issue behind this whole discussion?
> Hanging connections.
> I'll have a look at it tomorrow.

To everybody who's interested in BTL-TCP, especially George and (to a
minor degree) rhc:

I've integrated something what I call "magic address selection code".
See the comments in r17348.

Can you check

   https://svn.open-mpi.org/svn/ompi/tmp-public/btl-tcp

if it's working for you? Read: multi-rail TCP, FNN, whatever is
important to you?


The code is proof of concept and could use a little tuning (if it's
working at all. Over here, it satisfies all tests).

I vaguely remember that at least Ralph doesn't like

   int a[perm_size * sizeof(int)];

where perm_size is dynamically evaluated (read: array size is runtime
dependent)

There are also some large arrays, search for MAX_KERNEL_INTERFACE_INDEX.
Perhaps it's better to replace them with an appropriate OMPI data
structure. I don't know what fits best, you guys know the details...


So please give the code a try, and if it's working, feel free to cleanup
whatever is necessary to make it the OMPI style or give me some pointers
what to change.


I'd like to point to Thomas' diploma thesis. The PDF explains the theory
behind the code, it's like an rationale. Unfortunately, the PDF has some
typos, but I guess you'll get the idea. It's a graph matching algorithm,
Chapter 3 covers everything in detail:

 http://cluster.inf-ra.uni-jena.de/~adi/peiselt-thesis.pdf


HTH

-- 
Cluster and Metacomputing Working Group
Friedrich-Schiller-Universität Jena, Germany

private: http://adi.thur.de


Re: [OMPI devel] [OMPI svn] svn:open-mpi r17307

2008-01-30 Thread Adrian Knoth
On Wed, Jan 30, 2008 at 03:38:00PM +0100, Bogdan Costescu wrote:

> The results is that, with the default Linux kernel settings, there is 
> no way to tell which way a connection will take in a multi-rail TCP/IP 
> setup. Even more, when the ARP cache expires and a new ARP request is 
> made, the answer (MAC address) from the target/destination could be 
> different, so that from that moment on the connection could switch to 
> a different media. I've tested this recently with the RHEL5 kernels 
> with one gigabit and one Myri-10G connection, seeing a TCP stream 
> switching randomly between the gigabit and the Myri-10G connection.

That's weird. I've never seen this, but according to the various ARP
settings in the Linux kernel, I could imagine such a scenario.

IPv6 doesn't use ARP, but neighbourhood discovery. It's completely
different, and I hope it behaves "link local". It's a whole protocol
("ICMPv6"), so things might be better.


JFTR: http://www-uxsup.csx.cam.ac.uk/courses/ipv6_basics/x84.html

-- 
Cluster and Metacomputing Working Group
Friedrich-Schiller-Universität Jena, Germany

private: http://adi.thur.de


Re: [OMPI devel] [OMPI svn] svn:open-mpi r17307

2008-01-30 Thread Adrian Knoth
On Wed, Jan 30, 2008 at 12:05:50PM -0500, George Bosilca wrote:

> What is the real issue behind this whole discussion?

Hanging connections. See

   https://svn.open-mpi.org/trac/ompi/ticket/1206

The multi-address peer tries to connect, but btl_tcp_proc_accept denies
due to not matching addresses. (less btl_endpoints than possible source
addresses)

r17331 and r17332 haven't fixed the issue. Don't code when leaving the
office ;) I'll have a look at it tomorrow.

Sorry for all the noise in the trunk.

> multiple IP addresses by interface the connection step will work. Now  
> I can see a benefit of having multiple socket over the same link (and  
> it's already implemented in Open MPI), but I don't see the interest of  
> using multiple IP in this case.

I have an easy to reproduce testcase for #1206. If you like, we can step
through the debugger in a shared screen (screen -x) or VNC session.

Just mail me if you're interested. ;)



-- 
Cluster and Metacomputing Working Group
Friedrich-Schiller-Universität Jena, Germany

private: http://adi.thur.de


Re: [OMPI devel] [OMPI svn] svn:open-mpi r17307

2008-01-30 Thread George Bosilca
What is the real issue behind this whole discussion ? With one or  
multiple IP addresses by interface the connection step will work. Now  
I can see a benefit of having multiple socket over the same link (and  
it's already implemented in Open MPI), but I don't see the interest of  
using multiple IP in this case.


  george.

On Jan 30, 2008, at 9:37 AM, Jeff Squyres wrote:


Is one possible solution to have Open MPI mark in the packet where the
incoming connection is coming from?

On Jan 30, 2008, at 9:20 AM, Tim Mattox wrote:


Hello,

On Jan 30, 2008 3:17 AM, Adrian Knoth 
wrote:
[snip]
As mentioned earlier: it's very common to have multiple addresses  
per

interface, and it's the kernel who assigns the source address, so
there's nothing one could say about an incoming connection. Only
that it
could be any of all exported addresses. Any.



This is only partially correct.  Yes, by default the Linux kernel  
will

fill in the IP header with any of the IP addresses associated with
the machine, regardless of which NIC the packet will be sent on.
It was a never ending debate on the Linux Kernel Mailing list as to
what was the right way to do things... are IP addresses "owned" by
the machine, or are they "owned" by the NIC?  The kernel defaults
to the former definition (which is contrary to pretty much every
other OS on the planet... but the relevant RFCs left both
interpretations
open).  Anyway, there are ways to configure the networking stack of
the Linux kernel to get the other behavior, so that a packet will be
guaranteed to have one of the IP addresses associated with the NIC
that it uses for egress.
See Documentation/networking/ip-sysctl.txt in your Linux Kernel
sources
for a description of these relevant options:
arp_filter, arp_announce, arp_ignore
which are accessed on a live system here:
/proc/sys/net/ipv4/conf/all/

I guess if I put in the time, I could create a FAQ entry about it,
and what values to use... though I am not familiar with any
equivalent IPv6 settings (or if any exist).
--
Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/
tmat...@gmail.com || timat...@open-mpi.org
  I'm a bright... http://www.the-brights.net/
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




smime.p7s
Description: S/MIME cryptographic signature


Re: [OMPI devel] [OMPI svn] svn:open-mpi r17307

2008-01-30 Thread Bogdan Costescu

On Wed, 30 Jan 2008, Adrian Knoth wrote:

let me point out that the assumption "One interface, one address" 
isn't true.


Strictly speaking when configuring one address to one interface:

For Linux, "one interface, one address" isn't true; for Solaris and 
other Unices it is. The standards allow either of "assign addresses to 
interfaces" (Solaris way) and "assign addresses to hosts" (Linux way) 
to be used; this is a decision of the kernel network stack writers and 
can't be changed, however there are ways to configure the Linux stack 
to behave similar to the Solaris one from this point of view by 
limiting the ARP behaviour. The decision to assign addresses to hosts 
in Linux was made such that there is a better chance of reaching the 
host in case of misconfiguration or network problems. Indeed, even if 
an interface is down (f.e. cable unplugged or 'ifconfig ethX down'), 
the address is reachable via other interfaces, as a new ARP 
association is made between the IP address and the MAC address of the 
interface which is used.



As mentioned earlier: it's very common to have multiple addresses per
interface


That's the other case: an interface could have several addresses 
configured for it, f.e. via repeated 'ip add ... dev ethX', but this 
just adds to the number of addresses assigned to the host.


The results is that, with the default Linux kernel settings, there is 
no way to tell which way a connection will take in a multi-rail TCP/IP 
setup. Even more, when the ARP cache expires and a new ARP request is 
made, the answer (MAC address) from the target/destination could be 
different, so that from that moment on the connection could switch to 
a different media. I've tested this recently with the RHEL5 kernels 
with one gigabit and one Myri-10G connection, seeing a TCP stream 
switching randomly between the gigabit and the Myri-10G connection.


--
Bogdan Costescu

IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany
Phone: +49 6221 54 8869/8240, Fax: +49 6221 54 8868/8850
E-mail: bogdan.coste...@iwr.uni-heidelberg.de


Re: [OMPI devel] [OMPI svn] svn:open-mpi r17307

2008-01-30 Thread Jeff Squyres
Is one possible solution to have Open MPI mark in the packet where the  
incoming connection is coming from?


On Jan 30, 2008, at 9:20 AM, Tim Mattox wrote:


Hello,

On Jan 30, 2008 3:17 AM, Adrian Knoth   
wrote:

[snip]

As mentioned earlier: it's very common to have multiple addresses per
interface, and it's the kernel who assigns the source address, so
there's nothing one could say about an incoming connection. Only  
that it

could be any of all exported addresses. Any.



This is only partially correct.  Yes, by default the Linux kernel will
fill in the IP header with any of the IP addresses associated with
the machine, regardless of which NIC the packet will be sent on.
It was a never ending debate on the Linux Kernel Mailing list as to
what was the right way to do things... are IP addresses "owned" by
the machine, or are they "owned" by the NIC?  The kernel defaults
to the former definition (which is contrary to pretty much every
other OS on the planet... but the relevant RFCs left both  
interpretations

open).  Anyway, there are ways to configure the networking stack of
the Linux kernel to get the other behavior, so that a packet will be
guaranteed to have one of the IP addresses associated with the NIC
that it uses for egress.
See Documentation/networking/ip-sysctl.txt in your Linux Kernel  
sources

for a description of these relevant options:
 arp_filter, arp_announce, arp_ignore
which are accessed on a live system here:
 /proc/sys/net/ipv4/conf/all/

I guess if I put in the time, I could create a FAQ entry about it,
and what values to use... though I am not familiar with any
equivalent IPv6 settings (or if any exist).
--
Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/
tmat...@gmail.com || timat...@open-mpi.org
   I'm a bright... http://www.the-brights.net/
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] [OMPI svn] svn:open-mpi r17307

2008-01-30 Thread Adrian Knoth
On Tue, Jan 29, 2008 at 07:37:42PM -0500, George Bosilca wrote:

> The previous code was correct. Each IP address correspond to a  
> specific endpoint, and therefore to a specific BTL. This enable us to  
> have multiple TCP BTL at the same time, and allow the OB1 PML to  
> stripe the data over all of them.
> 
> Unfortunately, your commit disable the multi-rail over TCP. Please  
> undo it.

That's exactly what I had in mind when I said "this might break
functionality".

So we need as many endpoints as IP addresses? Then, simply connecting
them leads to oversubscription: two parallel connections on the same
media. That's where the kernel index enters the scene: we'll have to
make sure not to open two parallel connections to the same remote kernel
index.

I'll revert the patch and come up with another solution, but for the
moment, let me point out that the assumption "One interface, one
address" isn't true. So, the previous code was also wrong.


I hope not to run into model limitations: avoiding oversubscription
means to keep the number of endpoints per peer lower than the amount of
his interfaces, but accepting incoming connections from this peer means
to have all his addresses (probably more than #remote_NICs) available in
order to accept them.

As mentioned earlier: it's very common to have multiple addresses per
interface, and it's the kernel who assigns the source address, so
there's nothing one could say about an incoming connection. Only that it
could be any of all exported addresses. Any.


-- 
Cluster and Metacomputing Working Group
Friedrich-Schiller-Universität Jena, Germany

private: http://adi.thur.de


Re: [OMPI devel] [OMPI svn] svn:open-mpi r17307

2008-01-29 Thread George Bosilca
The previous code was correct. Each IP address correspond to a  
specific endpoint, and therefore to a specific BTL. This enable us to  
have multiple TCP BTL at the same time, and allow the OB1 PML to  
stripe the data over all of them.


Unfortunately, your commit disable the multi-rail over TCP. Please  
undo it.


  Thanks,
george.

On Jan 29, 2008, at 10:55 AM, a...@osl.iu.edu wrote:


Author: adi
Date: 2008-01-29 10:55:56 EST (Tue, 29 Jan 2008)
New Revision: 17307
URL: https://svn.open-mpi.org/trac/ompi/changeset/17307

Log:
accept incoming connections from hosts with multiple addresses.

We loop over all peer addresses and accept when one of them matches.
Note that this might break functionality: mca_btl_tcp_proc_insert now
always inserts the same endpoint. (is the lack of endpoints the  
problem?

should there be one for every remote address?)

Re #1206


Text files modified:
  trunk/ompi/mca/btl/tcp/btl_tcp_proc.c |12 ++--
  1 files changed, 6 insertions(+), 6 deletions(-)

Modified: trunk/ompi/mca/btl/tcp/btl_tcp_proc.c
= 
= 
= 
= 
= 
= 
= 
= 
==

--- trunk/ompi/mca/btl/tcp/btl_tcp_proc.c   (original)
+++ trunk/ompi/mca/btl/tcp/btl_tcp_proc.c	2008-01-29 10:55:56 EST  
(Tue, 29 Jan 2008)

@@ -327,16 +327,16 @@
{
size_t i;
OPAL_THREAD_LOCK(_proc->proc_lock);
-for( i = 0; i < btl_proc->proc_endpoint_count; i++ ) {
-mca_btl_base_endpoint_t* btl_endpoint = btl_proc- 
>proc_endpoints[i];

+for( i = 0; i < btl_proc->proc_addr_count; i++ ) {
+mca_btl_tcp_addr_t*  exported_address = btl_proc- 
>proc_addrs + i;
/* Check all conditions before going to try to accept the  
connection. */
-if( btl_endpoint->endpoint_addr->addr_family != addr- 
>sa_family ) {

+if( exported_address->addr_family != addr->sa_family ) {
continue;
}

switch (addr->sa_family) {
case AF_INET:
-if( memcmp( _endpoint->endpoint_addr->addr_inet,
+if( memcmp( _address->addr_inet,
&(((struct sockaddr_in*)addr)->sin_addr),
sizeof(struct in_addr) ) ) {
continue;
@@ -344,7 +344,7 @@
break;
#if OPAL_WANT_IPV6
case AF_INET6:
-if( memcmp( _endpoint->endpoint_addr->addr_inet,
+if( memcmp( _address->addr_inet,
&(((struct sockaddr_in6*)addr)->sin6_addr),
sizeof(struct in6_addr) ) ) {
continue;
@@ -355,7 +355,7 @@
;
}

-if(mca_btl_tcp_endpoint_accept(btl_endpoint, addr, sd)) {
+if(mca_btl_tcp_endpoint_accept(btl_proc->proc_endpoints[0],  
addr, sd)) {

OPAL_THREAD_UNLOCK(_proc->proc_lock);
return true;
}
___
svn mailing list
s...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/svn




smime.p7s
Description: S/MIME cryptographic signature