Hello Ralph and Gilles

Thanks for the clarification. My understanding was that if a BTL was
specified to mpirun, then only BTL (and, therefore, the ob1 PML) will be
used. However, I always saw that is not the case and now I know why.

I do have PSM capable cards (Qlogic IB) in my nodes, and this time, the
link was up (however, like I reported earlier, this behaviour happens even
with PSM link down), so obviously the PSM MTL was chosen.


Best regards
Durga

The surgeon general advises you to eat right, exercise regularly and quit
ageing.

On Thu, Apr 28, 2016 at 11:41 PM, Gilles Gouaillardet <
gilles.gouaillar...@gmail.com> wrote:

> my basic understanding is that ob1 works with btl, and cm works with mtl
> (please someone corrects me if I am wrong)
> an other way to put this is cm cannot use the tcp btl.
>
> so I can only guess one mtl (PSM ?) is available, and so cm is preferred
> over ob1.
>
> what if you
> mpirun --mca mtl ^psm ...
> is cm selected over ob1 ?
>
> note PSM does not disqualify itself if there is no link, and this is
> now being investigated at intel.
>
> Cheers,
>
> Gilles
>
> On Friday, April 29, 2016, dpchoudh . <dpcho...@gmail.com> wrote:
>
>> Hello Gilles
>>
>> You are absolutely right:
>>
>> 1. Adding --mca pml_base_verbose 100 does show that it is the cm PML that
>> is being picked by default (even for TCP)
>> 2. Adding --mca pml ob1 does cause add_procs() and related BTL friends to
>> be invoked.
>>
>>
>> With a command line of
>>
>> mpirun -np 2 -hostfile ~/hostfile -mca btl self,tcp  -mca
>> btl_base_verbose 100 -mca pml_base_verbose 100 ./mpitest
>>
>> The output shows (among many other lines) the following:
>>
>> [smallMPI:49178] select: init returned priority 30
>> [smallMPI:49178] select: initializing pml component ob1
>> [smallMPI:49178] select: init returned priority 20
>> [smallMPI:49178] select: component v not in the include list
>> [smallMPI:49178] selected cm best priority 30
>>
>> *[smallMPI:49178] select: component ob1 not selected /
>> finalized[smallMPI:49178] select: component cm selected*
>>
>> Which shows that the cm PML was selected. Replacing 'tcp' above with
>> 'openib' shows very similar results. (The openib BTL methods are not
>> invoked, either)
>>
>> However, I was under the impression that the CM PML can only handle MTLs
>> (and ob1 can only handle BTLs). So why is cm being selected for TCP?
>>
>> Thank you
>> Durga
>>
>>
>>
>> The surgeon general advises you to eat right, exercise regularly and quit
>> ageing.
>>
>> On Thu, Apr 28, 2016 at 2:34 AM, Gilles Gouaillardet <gil...@rist.or.jp>
>> wrote:
>>
>>> the add_procs subroutine of the btl should be called.
>>>
>>> /* i added a printf in mca_btl_tcp_add_procs and it *is* invoked */
>>>
>>> can you try again with --mca pml ob1 --mca pml_base_verbose 100 ?
>>>
>>> maybe the add_procs subroutine is not invoked because openmpi uses cm
>>> instead of ob1
>>>
>>>
>>> Cheers,
>>>
>>>
>>> Gilles
>>>
>>> On 4/28/2016 3:07 PM, dpchoudh . wrote:
>>>
>>> Hello all
>>>
>>> I am struggling with this issue for last few days and thought it would
>>> be prudent to ask for help from people who have way more experience than I
>>> do.
>>>
>>> There are two questions, interrelated in my mind, but may not be so in
>>> reality. Question 2 is the issue I am struggling with, and question 1 sort
>>> of leads to it.
>>>
>>> 1. I see that both in openib and tcp BTL (the two kind of hardware I
>>> have access to) a modex send happens, but a matching modex receive never
>>> happens. Is it because of some kind of optimization? (In my case, both IP
>>> NICs are in the same IP subnet and both IB NICs are in the same IB subnet)
>>> Or am I not understanding something? How do the processes figure out their
>>> peer information without a modex receive?
>>>
>>> The place in code where the modex receive is called is in
>>> btl_add_procs(). However, it looks like in both the above BTLs, this method
>>> is never called. Is that expected?
>>>
>>> 2. This is the real question is this:
>>> I am writing a BTL for a proprietary RDMA NIC (named 'lf' in the code)
>>> that has no routing capability in protocol, and hence no concept of
>>> subnets. An HCA simply needs to be plugged in to the switch and it can see
>>> the whole network. However, there is a VLAN like partition (similar to IB
>>> partitions)
>>> Given this (and as a first cut, every node is in the same partition, so
>>> even this complexity is eliminated), there is not much use for a modex
>>> exchange, but I added one anyway just with the partition key.
>>>
>>> What I see is that the component open, register and init are all
>>> successful, but r2 bml still does not choose this network and thus OMPI
>>> aborts because of lack of full reachability.
>>>
>>> This is my command line:
>>> sudo /usr/local/bin/mpirun --allow-run-as-root -hostfile ~/hostfile -np
>>> 2 -mca btl self,lf -mca btl_base_verbose 100 -mca bml_base_verbose 100
>>> ./mpitest
>>>
>>> ('mpitest' is a trivial 'hello world' program plus ONE
>>> MPI_Send()/MPI_Recv() to test in-band communication. The sudo is required
>>> because currently the driver requires root permission; I was told that this
>>> will be fixed. The hostfile has 2 hosts, named b-2 and b-3, with
>>> back-to-back connection on this 'lf' HCA)
>>>
>>> The output of this command is as follows; I have added my comments to
>>> explain it a bit.
>>>
>>> <Output from OMPI logging mechanism>
>>> [b-2:21062] mca: base: components_register: registering framework bml
>>> components
>>> [b-2:21062] mca: base: components_register: found loaded component r2
>>> [b-2:21062] mca: base: components_register: component r2 register
>>> function successful
>>> [b-2:21062] mca: base: components_open: opening bml components
>>> [b-2:21062] mca: base: components_open: found loaded component r2
>>> [b-2:21062] mca: base: components_open: component r2 open function
>>> successful
>>> [b-2:21062] mca: base: components_register: registering framework btl
>>> components
>>> [b-2:21062] mca: base: components_register: found loaded component self
>>> [b-2:21062] mca: base: components_register: component self register
>>> function successful
>>> [b-2:21062] mca: base: components_register: found loaded component lf
>>> [b-2:21062] mca: base: components_register: component lf register
>>> function successful
>>> [b-2:21062] mca: base: components_open: opening btl components
>>> [b-2:21062] mca: base: components_open: found loaded component self
>>> [b-2:21062] mca: base: components_open: component self open function
>>> successful
>>> [b-2:21062] mca: base: components_open: found loaded component lf
>>>
>>> <Debugging output from the HCA driver>
>>> lf_group_lib.c:442: _lf_open: _lf_open("MPI_0",0x842,0x1b6,4096,0)
>>>
>>> <Output from OMPI logging mechanism, continued>
>>> [b-2:21062] mca: base: components_open: component lf open function
>>> successful
>>> [b-2:21062] select: initializing btl component self
>>> [b-2:21062] select: init of component self returned success
>>> [b-2:21062] select: initializing btl component lf
>>>
>>> <Debugging output from the HCA driver>
>>> Created group on b-2
>>>
>>> <Output from OMPI logging mechanism, continued>
>>> [b-2:21062] select: init of component lf returned success
>>> [b-3:07672] mca: base: components_register: registering framework bml
>>> components
>>> [b-3:07672] mca: base: components_register: found loaded component r2
>>> [b-3:07672] mca: base: components_register: component r2 register
>>> function successful
>>> [b-3:07672] mca: base: components_open: opening bml components
>>> [b-3:07672] mca: base: components_open: found loaded component r2
>>> [b-3:07672] mca: base: components_open: component r2 open function
>>> successful
>>> [b-3:07672] mca: base: components_register: registering framework btl
>>> components
>>> [b-3:07672] mca: base: components_register: found loaded component self
>>> [b-3:07672] mca: base: components_register: component self register
>>> function successful
>>> [b-3:07672] mca: base: components_register: found loaded component lf
>>> [b-3:07672] mca: base: components_register: component lf register
>>> function successful
>>> [b-3:07672] mca: base: components_open: opening btl components
>>> [b-3:07672] mca: base: components_open: found loaded component self
>>> [b-3:07672] mca: base: components_open: component self open function
>>> successful
>>> [b-3:07672] mca: base: components_open: found loaded component lf
>>> [b-3:07672] mca: base: components_open: component lf open function
>>> successful
>>> [b-3:07672] select: initializing btl component self
>>> [b-3:07672] select: init of component self returned success
>>> [b-3:07672] select: initializing btl component lf
>>>
>>> <Debugging output from the HCA driver>
>>> lf_group_lib.c:442: _lf_open: _lf_open("MPI_0",0x842,0x1b6,4096,0)
>>> Created group on b-3
>>>
>>> <Output from OMPI logging mechanism, continued>
>>> [b-3:07672] select: init of component lf returned success
>>> [b-2:21062] mca: bml: Using self btl for send to [[6866,1],0] on node b-2
>>> [b-3:07672] mca: bml: Using self btl for send to [[6866,1],1] on node b-3
>>>
>>> <Output from the 'mpitest' MPI program: out-of-band-I/O>
>>> Hello from b-2
>>> The world has 2 nodes
>>> My rank is 0
>>> Hello from b-3
>>>
>>> <Output frm OMPI>
>>>
>>> --------------------------------------------------------------------------
>>> At least one pair of MPI processes are unable to reach each other for
>>> MPI communications.  This means that no Open MPI device has indicated
>>> that it can be used to communicate between these processes.  This is
>>> an error; Open MPI requires that all MPI processes be able to reach
>>> each other.  This error can sometimes be the result of forgetting to
>>> specify the "self" BTL.
>>>
>>>   Process 1 ([[6866,1],0]) is on host: b-2
>>>   Process 2 ([[6866,1],1]) is on host: 10.4.70.12
>>>   BTLs attempted: self
>>>
>>> Your MPI job is now going to abort; sorry.
>>>
>>> --------------------------------------------------------------------------
>>>
>>> <Output from the 'mpitest' MPI program: out-of-band-I/O, continued>
>>> The world has 2 nodes
>>> My rank is 1
>>>
>>> <Output from OMPI logging mechanism, continued>
>>> [b-2:21062] *** An error occurred in MPI_Send
>>> [b-2:21062] *** reported by process [140385751007233,21474836480]
>>> [b-2:21062] *** on communicator MPI_COMM_WORLD
>>> [b-2:21062] *** MPI_ERR_INTERN: internal error
>>> [b-2:21062] *** MPI_ERRORS_ARE_FATAL (processes in this communicator
>>> will now abort,
>>> [b-2:21062] ***    and potentially your MPI job)
>>> [durga@b-2 ~]$
>>>
>>> As you can see, the lf network is not being chosen for communication.
>>> Without a modex exchange, how can that happen? Or, in a nutshell, what do I
>>> need to do?
>>>
>>> Thanks a lot in advance
>>> Durga
>>>
>>>
>>> 1% of the executables have 99% of CPU privilege!
>>> Userspace code! Unite!! Occupy the kernel!!!
>>>
>>>
>>> _______________________________________________
>>> devel mailing listde...@open-mpi.org
>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2016/04/18827.php
>>>
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/devel/2016/04/18828.php
>>>
>>
>>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/04/18837.php
>

Reply via email to