Hello all

I am struggling with this issue for last few days and thought it would be
prudent to ask for help from people who have way more experience than I do.

There are two questions, interrelated in my mind, but may not be so in
reality. Question 2 is the issue I am struggling with, and question 1 sort
of leads to it.

1. I see that both in openib and tcp BTL (the two kind of hardware I have
access to) a modex send happens, but a matching modex receive never
happens. Is it because of some kind of optimization? (In my case, both IP
NICs are in the same IP subnet and both IB NICs are in the same IB subnet)
Or am I not understanding something? How do the processes figure out their
peer information without a modex receive?

The place in code where the modex receive is called is in btl_add_procs().
However, it looks like in both the above BTLs, this method is never called.
Is that expected?

2. This is the real question is this:
I am writing a BTL for a proprietary RDMA NIC (named 'lf' in the code) that
has no routing capability in protocol, and hence no concept of subnets. An
HCA simply needs to be plugged in to the switch and it can see the whole
network. However, there is a VLAN like partition (similar to IB partitions)
Given this (and as a first cut, every node is in the same partition, so
even this complexity is eliminated), there is not much use for a modex
exchange, but I added one anyway just with the partition key.

What I see is that the component open, register and init are all
successful, but r2 bml still does not choose this network and thus OMPI
aborts because of lack of full reachability.

This is my command line:
sudo /usr/local/bin/mpirun --allow-run-as-root -hostfile ~/hostfile -np 2
-mca btl self,lf -mca btl_base_verbose 100 -mca bml_base_verbose 100
./mpitest

('mpitest' is a trivial 'hello world' program plus ONE
MPI_Send()/MPI_Recv() to test in-band communication. The sudo is required
because currently the driver requires root permission; I was told that this
will be fixed. The hostfile has 2 hosts, named b-2 and b-3, with
back-to-back connection on this 'lf' HCA)

The output of this command is as follows; I have added my comments to
explain it a bit.

<Output from OMPI logging mechanism>
[b-2:21062] mca: base: components_register: registering framework bml
components
[b-2:21062] mca: base: components_register: found loaded component r2
[b-2:21062] mca: base: components_register: component r2 register function
successful
[b-2:21062] mca: base: components_open: opening bml components
[b-2:21062] mca: base: components_open: found loaded component r2
[b-2:21062] mca: base: components_open: component r2 open function
successful
[b-2:21062] mca: base: components_register: registering framework btl
components
[b-2:21062] mca: base: components_register: found loaded component self
[b-2:21062] mca: base: components_register: component self register
function successful
[b-2:21062] mca: base: components_register: found loaded component lf
[b-2:21062] mca: base: components_register: component lf register function
successful
[b-2:21062] mca: base: components_open: opening btl components
[b-2:21062] mca: base: components_open: found loaded component self
[b-2:21062] mca: base: components_open: component self open function
successful
[b-2:21062] mca: base: components_open: found loaded component lf

<Debugging output from the HCA driver>
lf_group_lib.c:442: _lf_open: _lf_open("MPI_0",0x842,0x1b6,4096,0)

<Output from OMPI logging mechanism, continued>
[b-2:21062] mca: base: components_open: component lf open function
successful
[b-2:21062] select: initializing btl component self
[b-2:21062] select: init of component self returned success
[b-2:21062] select: initializing btl component lf

<Debugging output from the HCA driver>
Created group on b-2

<Output from OMPI logging mechanism, continued>
[b-2:21062] select: init of component lf returned success
[b-3:07672] mca: base: components_register: registering framework bml
components
[b-3:07672] mca: base: components_register: found loaded component r2
[b-3:07672] mca: base: components_register: component r2 register function
successful
[b-3:07672] mca: base: components_open: opening bml components
[b-3:07672] mca: base: components_open: found loaded component r2
[b-3:07672] mca: base: components_open: component r2 open function
successful
[b-3:07672] mca: base: components_register: registering framework btl
components
[b-3:07672] mca: base: components_register: found loaded component self
[b-3:07672] mca: base: components_register: component self register
function successful
[b-3:07672] mca: base: components_register: found loaded component lf
[b-3:07672] mca: base: components_register: component lf register function
successful
[b-3:07672] mca: base: components_open: opening btl components
[b-3:07672] mca: base: components_open: found loaded component self
[b-3:07672] mca: base: components_open: component self open function
successful
[b-3:07672] mca: base: components_open: found loaded component lf
[b-3:07672] mca: base: components_open: component lf open function
successful
[b-3:07672] select: initializing btl component self
[b-3:07672] select: init of component self returned success
[b-3:07672] select: initializing btl component lf

<Debugging output from the HCA driver>
lf_group_lib.c:442: _lf_open: _lf_open("MPI_0",0x842,0x1b6,4096,0)
Created group on b-3

<Output from OMPI logging mechanism, continued>
[b-3:07672] select: init of component lf returned success
[b-2:21062] mca: bml: Using self btl for send to [[6866,1],0] on node b-2
[b-3:07672] mca: bml: Using self btl for send to [[6866,1],1] on node b-3

<Output from the 'mpitest' MPI program: out-of-band-I/O>
Hello from b-2
The world has 2 nodes
My rank is 0
Hello from b-3

<Output frm OMPI>
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[6866,1],0]) is on host: b-2
  Process 2 ([[6866,1],1]) is on host: 10.4.70.12
  BTLs attempted: self

Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------

<Output from the 'mpitest' MPI program: out-of-band-I/O, continued>
The world has 2 nodes
My rank is 1

<Output from OMPI logging mechanism, continued>
[b-2:21062] *** An error occurred in MPI_Send
[b-2:21062] *** reported by process [140385751007233,21474836480]
[b-2:21062] *** on communicator MPI_COMM_WORLD
[b-2:21062] *** MPI_ERR_INTERN: internal error
[b-2:21062] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will
now abort,
[b-2:21062] ***    and potentially your MPI job)
[durga@b-2 ~]$

As you can see, the lf network is not being chosen for communication.
Without a modex exchange, how can that happen? Or, in a nutshell, what do I
need to do?

Thanks a lot in advance
Durga


1% of the executables have 99% of CPU privilege!
Userspace code! Unite!! Occupy the kernel!!!

Reply via email to