In Open MPI a process only retrieve information about a peer if they
communicate. Thus, the add_proc is called from the two sides of a
connection establishment, when locally a connection is decided or when a
network packet requires a the existence of a proc (for the initiator of the
connection). Thus, the modex_recv is called deep inside the BTL, usually
when the proc is created (TCP has a specific function for
this mca_btl_tcp_proc_create).

More inline.

On Thu, Apr 28, 2016 at 2:07 AM, dpchoudh . <dpcho...@gmail.com> wrote:

> Hello all
>
> I am struggling with this issue for last few days and thought it would be
> prudent to ask for help from people who have way more experience than I do.
>
> There are two questions, interrelated in my mind, but may not be so in
> reality. Question 2 is the issue I am struggling with, and question 1 sort
> of leads to it.
>
> 1. I see that both in openib and tcp BTL (the two kind of hardware I have
> access to) a modex send happens, but a matching modex receive never
> happens. Is it because of some kind of optimization? (In my case, both IP
> NICs are in the same IP subnet and both IB NICs are in the same IB subnet)
> Or am I not understanding something? How do the processes figure out their
> peer information without a modex receive?
>
> The place in code where the modex receive is called is in btl_add_procs().
> However, it looks like in both the above BTLs, this method is never called.
> Is that expected?
>

It is called upon communications. Force an MPI_Barrier in your example, and
you will see the add_proc called for a subset of the peers (depending on
the barrier algorithm).


>
> 2. This is the real question is this:
> I am writing a BTL for a proprietary RDMA NIC (named 'lf' in the code)
> that has no routing capability in protocol, and hence no concept of
> subnets. An HCA simply needs to be plugged in to the switch and it can see
> the whole network. However, there is a VLAN like partition (similar to IB
> partitions)
> Given this (and as a first cut, every node is in the same partition, so
> even this complexity is eliminated), there is not much use for a modex
> exchange, but I added one anyway just with the partition key.
>
> What I see is that the component open, register and init are all
> successful, but r2 bml still does not choose this network and thus OMPI
> aborts because of lack of full reachability.
>

The BML should have selected your network if your _open returned
OPAL_SUCCESS and your _init returned a module (or a list of).


>
> This is my command line:
> sudo /usr/local/bin/mpirun --allow-run-as-root -hostfile ~/hostfile -np 2
> -mca btl self,lf -mca btl_base_verbose 100 -mca bml_base_verbose 100
> ./mpitest
>
> ('mpitest' is a trivial 'hello world' program plus ONE
> MPI_Send()/MPI_Recv() to test in-band communication. The sudo is required
> because currently the driver requires root permission; I was told that this
> will be fixed. The hostfile has 2 hosts, named b-2 and b-3, with
> back-to-back connection on this 'lf' HCA)
>
> The output of this command is as follows; I have added my comments to
> explain it a bit.
>
> <Output from OMPI logging mechanism>
> [b-2:21062] mca: base: components_register: registering framework bml
> components
> [b-2:21062] mca: base: components_register: found loaded component r2
> [b-2:21062] mca: base: components_register: component r2 register function
> successful
> [b-2:21062] mca: base: components_open: opening bml components
> [b-2:21062] mca: base: components_open: found loaded component r2
> [b-2:21062] mca: base: components_open: component r2 open function
> successful
> [b-2:21062] mca: base: components_register: registering framework btl
> components
> [b-2:21062] mca: base: components_register: found loaded component self
> [b-2:21062] mca: base: components_register: component self register
> function successful
> [b-2:21062] mca: base: components_register: found loaded component lf
> [b-2:21062] mca: base: components_register: component lf register function
> successful
> [b-2:21062] mca: base: components_open: opening btl components
> [b-2:21062] mca: base: components_open: found loaded component self
> [b-2:21062] mca: base: components_open: component self open function
> successful
> [b-2:21062] mca: base: components_open: found loaded component lf
>

Your _open return OPAL_SUCCESS, the BTL is on.


>
> <Debugging output from the HCA driver>
> lf_group_lib.c:442: _lf_open: _lf_open("MPI_0",0x842,0x1b6,4096,0)
>
> <Output from OMPI logging mechanism, continued>
> [b-2:21062] mca: base: components_open: component lf open function
> successful
> [b-2:21062] select: initializing btl component self
> [b-2:21062] select: init of component self returned success
> [b-2:21062] select: initializing btl component lf
>
> <Debugging output from the HCA driver>
> Created group on b-2
>
> <Output from OMPI logging mechanism, continued>
> [b-2:21062] select: init of component lf returned success
> [b-3:07672] mca: base: components_register: registering framework bml
> components
> [b-3:07672] mca: base: components_register: found loaded component r2
> [b-3:07672] mca: base: components_register: component r2 register function
> successful
> [b-3:07672] mca: base: components_open: opening bml components
> [b-3:07672] mca: base: components_open: found loaded component r2
> [b-3:07672] mca: base: components_open: component r2 open function
> successful
> [b-3:07672] mca: base: components_register: registering framework btl
> components
> [b-3:07672] mca: base: components_register: found loaded component self
> [b-3:07672] mca: base: components_register: component self register
> function successful
> [b-3:07672] mca: base: components_register: found loaded component lf
> [b-3:07672] mca: base: components_register: component lf register function
> successful
> [b-3:07672] mca: base: components_open: opening btl components
> [b-3:07672] mca: base: components_open: found loaded component self
> [b-3:07672] mca: base: components_open: component self open function
> successful
> [b-3:07672] mca: base: components_open: found loaded component lf
> [b-3:07672] mca: base: components_open: component lf open function
> successful
> [b-3:07672] select: initializing btl component self
> [b-3:07672] select: init of component self returned success
> [b-3:07672] select: initializing btl component lf
>
> <Debugging output from the HCA driver>
> lf_group_lib.c:442: _lf_open: _lf_open("MPI_0",0x842,0x1b6,4096,0)
> Created group on b-3
>
> <Output from OMPI logging mechanism, continued>
> [b-3:07672] select: init of component lf returned success
>

Based on this output I assume that your BTL has returned something else
than NULL.


> [b-2:21062] mca: bml: Using self btl for send to [[6866,1],0] on node b-2
> [b-3:07672] mca: bml: Using self btl for send to [[6866,1],1] on node b-3
>

This is definitively not good (aka. self should never be selected for
communication with another process) but I guess it is a side effect of the
fact that we lost our connectivity map (thus any BTL is assumed as being
able to communicate with anybody else).

The problem might be coming from here. Is your BTL setting the exclusivity
flag ? What is your BTL priority ? I would look more in details at the
mca_bml_r2_endpoint_add_btl function.

  George.




>
>
> <Output from the 'mpitest' MPI program: out-of-band-I/O>
> Hello from b-2
> The world has 2 nodes
> My rank is 0
> Hello from b-3
>
> <Output frm OMPI>
> --------------------------------------------------------------------------
> At least one pair of MPI processes are unable to reach each other for
> MPI communications.  This means that no Open MPI device has indicated
> that it can be used to communicate between these processes.  This is
> an error; Open MPI requires that all MPI processes be able to reach
> each other.  This error can sometimes be the result of forgetting to
> specify the "self" BTL.
>
>   Process 1 ([[6866,1],0]) is on host: b-2
>   Process 2 ([[6866,1],1]) is on host: 10.4.70.12
>   BTLs attempted: self
>
> Your MPI job is now going to abort; sorry.
> --------------------------------------------------------------------------
>
> <Output from the 'mpitest' MPI program: out-of-band-I/O, continued>
> The world has 2 nodes
> My rank is 1
>
> <Output from OMPI logging mechanism, continued>
> [b-2:21062] *** An error occurred in MPI_Send
> [b-2:21062] *** reported by process [140385751007233,21474836480]
> [b-2:21062] *** on communicator MPI_COMM_WORLD
> [b-2:21062] *** MPI_ERR_INTERN: internal error
> [b-2:21062] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will
> now abort,
> [b-2:21062] ***    and potentially your MPI job)
> [durga@b-2 ~]$
>
> As you can see, the lf network is not being chosen for communication.
> Without a modex exchange, how can that happen? Or, in a nutshell, what do I
> need to do?
>
> Thanks a lot in advance
> Durga
>
>
> 1% of the executables have 99% of CPU privilege!
> Userspace code! Unite!! Occupy the kernel!!!
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/04/18827.php
>

Reply via email to