From time-to-time, and have a need for running Open MPI apps using the openib 
btl on a single node, where port 1 on the HCA is connected to port 2 on the 
same HCA.

Using a vintage 1.5.4, my command line would read:

mpiexec --mca btl self,openib --mca btl_openib_cpc_include oob \
   -np 1 /usr/bin/env OMPI_MCA_btl_openib_if_include=mlx4_0:1 ./a.out  : \
   -np 1 /usr/bin/env OMPI_MCA_btl_openib_if_include=mlx4_0:2 ./a.out


Now, I had a need for a newer Open MPI, and compiled and installed version 
1.8.2. Now the problems began ;-) Apparently, the old (and in my opinion 
nice)"oob" connection management method has disappeared. However, by modifying 
the command line to:

mpiexec --mca btl self,openib --mca btl_openib_cpc_include udcm \
   -np 1 /usr/bin/env OMPI_MCA_btl_openib_if_include=mlx4_0:1 ./a.out : \
   -np 1 /usr/bin/env OMPI_MCA_btl_openib_if_include=mlx4_0:2 ./a.out


I get tons of:

connect/btl_openib_connect_udcm.c:1390:udcm_find_endpoint] could not find 
endpoint with port: 1, lid: 4608, msg_type: 100

Interestingly, the lid here is the lid for Port 2 (when port numbers start at 
1). I do suspect that the printout above counts ports from zero.

Anyway, must I get back to an older Open MPI supporting "oob", or do I have a 
flaw in my command line?


Thanks, Håkon

Reply via email to