I did not say we abort, I say we prevent BTL TCP from being used. In
your example, I guess the TCP is disabled but the PML finds another
available interface and keeps going. If I try the same thing with
"--mca btl tcp,self" it does abort on my cluster.

---
mpirun -np 2 --mca btl tcp,self --mca btl_tcp_if_include eth3 ./ring_c
[dancer02][[48001,1],1][../../../../../ompi/ompi/mca/btl/tcp/btl_tcp_component.c:682:mca_btl_tcp_component_create_instances]
invalid interface "eth3"
[dancer01][[48001,1],0][../../../../../ompi/ompi/mca/btl/tcp/btl_tcp_component.c:682:mca_btl_tcp_component_create_instances]
invalid interface "eth3"
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[48001,1],0]) is on host: node01
  Process 2 ([[48001,1],1]) is on host: node02
  BTLs attempted: self

Your MPI job is now going to abort; sorry.
---

The only reason I see for having the if_seq in first place, it to
nicely balance the TCP traffic over multiple interfaces. As your patch
set the if_seq to NULL, it basically allows the TCP BTL to use __all__
available interfaces, reaching exactly the opposite compared to the
usage of the if_seq specified by the user. As a result the application
will execute over all available interfaces, the result (especially in
terms of performance) might not be what the users expected. Very
confusing from my perspective.

  George.


On Fri, Feb 1, 2013 at 6:50 PM, Jeff Squyres (jsquyres)
<jsquy...@cisco.com> wrote:
> On Feb 1, 2013, at 6:28 PM, George Bosilca <bosi...@icl.utk.edu> wrote:
>
>> So far, all interfaces specified via MCA parameters for the BTL TCP
>> are required to exist. Otherwise an error message is printed and an
>> error returned to the upper level, with the intent that no BTLs of
>> this type will be enabled (as an example btl_tcp_component.c:682).
>
> Actually, it doesn't -- that's why I made this one match the other behavior.
>
> For example, if I exclude an interface that doesn't exist (on v1.6 HEAD):
>
> -----
> [15:40] savbu-usnic:~/svn/ompi-1.6/examples % mpirun -np 2 --mca 
> btl_tcp_if_exclude lo,bogus ring_c
> Process 0 sending 10 to 1, tag 201 (2 processes in ring)
> Process 0 sent to 1
> Process 0 decremented value: 9
> Process 0 decremented value: 8
> Process 0 decremented value: 7
> Process 0 decremented value: 6
> Process 0 decremented value: 5
> Process 0 decremented value: 4
> Process 0 decremented value: 3
> Process 0 decremented value: 2
> Process 0 decremented value: 1
> Process 0 decremented value: 0
> Process 0 exiting
> Process 1 exiting
> [15:40] savbu-usnic:~/svn/ompi-1.6/examples %
> -----
>
> Or if I include an interface that doesn't exist (although this one warns):
>
> -----
> [15:40] savbu-usnic:~/svn/ompi-1.6/examples % mpirun -np 2 --mca 
> btl_tcp_if_include eth0,bogus ring_c
> [savbu-usnic][[7221,1],0][btl_tcp_component.c:682:mca_btl_tcp_component_create_instances]
>  invalid interface "bogus"
> [savbu-usnic][[7221,1],1][btl_tcp_component.c:682:mca_btl_tcp_component_create_instances]
>  invalid interface "bogus"
> Process 0 sending 10 to 1, tag 201 (2 processes in ring)
> Process 0 sent to 1
> Process 0 decremented value: 9
> Process 0 decremented value: 8
> Process 0 decremented value: 7
> Process 0 decremented value: 6
> Process 0 decremented value: 5
> Process 0 decremented value: 4
> Process 0 decremented value: 3
> Process 0 decremented value: 2
> Process 0 decremented value: 1
> Process 0 decremented value: 0
> Process 0 exiting
> Process 1 exiting
> [15:42] savbu-usnic:~/svn/ompi-1.6/examples %
> -----
>
> Are there other cases that I'm missing where we *do* abort?
>
> If so, we should probably be consistent: pick one way (abort or not abort) 
> and do that in all cases.  I don't think I have much of an opinion here on 
> which way we should go; I can see multiple arguments:
>
> - We should abort: we have a large precedent in many other place in OMPI that 
> if a human asks for something OMPI can't deliver, we abort and make the human 
> figure it out.
>
> - We should warn/not abort: this is the behavior we've had for a long time.  
> Changing it may break backwards compatibility.
>
>
>
>> If I correctly understand your commit, it change this [so far
>> consistent] behavior for a single of our TCP MCA parameter (if_seq)
>> to: print an error message and then continue. As you set
>> themca_btl_tcp_component.tcp_if_seq to NULL this is as if this
>> argument was never provided.
>>
>> I prefer the old behavior for its corrective meaning (you fix it and
>> then it works), as well as for its consistency with the other BTL TCP
>> parameters.
>>
>>  George.
>>
>>
>>
>> On Fri, Feb 1, 2013 at 3:17 PM,  <svn-commit-mai...@open-mpi.org> wrote:
>>> Author: jsquyres (Jeff Squyres)
>>> Date: 2013-02-01 15:17:43 EST (Fri, 01 Feb 2013)
>>> New Revision: 28016
>>> URL: https://svn.open-mpi.org/trac/ompi/changeset/28016
>>>
>>> Log:
>>> As the help message states, it's not an ''error'' if the specified
>>> interface is not found.  It should just be skipped.
>>>
>>> Text files modified:
>>>   trunk/ompi/mca/btl/tcp/btl_tcp_component.c |     8 +++++---
>>>   1 files changed, 5 insertions(+), 3 deletions(-)
>>>
>>> Modified: trunk/ompi/mca/btl/tcp/btl_tcp_component.c
>>> ==============================================================================
>>> --- trunk/ompi/mca/btl/tcp/btl_tcp_component.c  Fri Feb  1 09:27:37 2013    
>>>     (r28015)
>>> +++ trunk/ompi/mca/btl/tcp/btl_tcp_component.c  2013-02-01 15:17:43 EST 
>>> (Fri, 01 Feb 2013)      (r28016)
>>> @@ -314,10 +314,12 @@
>>>                                ompi_process_info.nodename,
>>>                                mca_btl_tcp_component.tcp_if_seq,
>>>                                "Interface does not exist");
>>> -                return OMPI_ERR_BAD_PARAM;
>>> +                free(mca_btl_tcp_component.tcp_if_seq);
>>> +                mca_btl_tcp_component.tcp_if_seq = NULL;
>>> +            } else {
>>> +                BTL_VERBOSE(("Node rank %d using TCP interface %s",
>>> +                             node_rank, mca_btl_tcp_component.tcp_if_seq));
>>>             }
>>> -            BTL_VERBOSE(("Node rank %d using TCP interface %s",
>>> -                         node_rank, mca_btl_tcp_component.tcp_if_seq));
>>>         }
>>>     }
>>>
>>> _______________________________________________
>>> svn mailing list
>>> s...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/svn
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to