Hi, For rdmacm to work with openib btl, the first receive queue needs to be a point-to-point queue (and not SRQ which is the default in OMPI v2.x). Can you please try adding this parameter to the command line? -mca btl_openib_receive_queues P,4096,8,6,4 You can change the numbers according to what you need.
Alina. On Tue, Jun 13, 2017 at 7:57 PM, Chuanxiong Guo <chuanxiong....@gmail.com> wrote: > here it is: > ~/openmpi/bin/mpirun -np 2 -hostfile hostfile --mca btl openib,self,sm > --mca btl_openib_cpc_include rdmacm --mca btl_openib_rroce_enable 1 > ./sendrecv > > what I got as follows. > > -------------------------------------------------------------------------- > WARNING: There was an error initializing an OpenFabrics device. > Local host: chguo-msr-linux1 > Local device: mlx5_0 > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > At least one pair of MPI processes are unable to reach each other for > MPI communications. This means that no Open MPI device has indicated > that it can be used to communicate between these processes. This is > an error; Open MPI requires that all MPI processes be able to reach > each other. This error can sometimes be the result of forgetting to > specify the "self" BTL. > Process 1 ([[45408,1],0]) is on host: chguo-msr-linux1 > Process 2 ([[45408,1],1]) is on host: chguo-msr-linux02 > BTLs attempted: self > Your MPI job is now going to abort; sorry. > -------------------------------------------------------------------------- > [chguo-msr-linux1:12690] *** An error occurred in MPI_Send > [chguo-msr-linux1:12690] *** reported by process [140379686961153, > 140376711102464] > [chguo-msr-linux1:12690] *** on communicator MPI_COMM_WORLD > [chguo-msr-linux1:12690] *** MPI_ERR_INTERN: internal error > [chguo-msr-linux1:12690] *** MPI_ERRORS_ARE_FATAL (processes in this > communicator will now abort, > [chguo-msr-linux1:12690] *** and potentially your MPI job) > [chguo-msr-linux1:12684] 1 more process has sent help message > help-mpi-btl-openib.txt / error in device init > [chguo-msr-linux1:12684] Set MCA parameter "orte_base_help_aggregate" to 0 > to see all help / error messages > > > On Tue, Jun 13, 2017 at 5:05 AM, Joshua Ladd <jladd.m...@gmail.com> wrote: > >> Hi, >> >> Please include your full command line. >> >> Josh >> >> On Mon, Jun 12, 2017 at 6:17 PM, Chuanxiong Guo <chuanxiong....@gmail.com >> > wrote: >> >>> Hi, >>> >>> I have two servers with Mellanox CX4-LX (50GbE Ethernet) back-to-back >>> connected. I am using Ubuntu 14-04. I have made mvapich2 work, and I can >>> confirm both roce and rocev2 work well (by packet capturing). >>> >>> But I still cannot make openmpi work with roce. I am using openmpi >>> 2.1.1. >>> It looks that this version of openmpi does not recognize CX4-LX, which I >>> have added vendor part id 4117 to mca-btl-openib-device-params.ini, and >>> I have also updated opal/mca/common/verbs/common_verbs_port.c to >>> support CX4-LX, which has speed 64 and width 1. >>> >>> But I am still getting: >>> >>> "WARNING: There was an error initializing an OpenFabrics device. >>> Local host: chguo-msr-linux1 >>> >>> Local device: mlx5_0 >>> " >>> Any hint on what are missing? >>> >>> Thanks, >>> CX >>> >>> >>> _______________________________________________ >>> devel mailing list >>> devel@lists.open-mpi.org >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>> >> >> > > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >
_______________________________________________ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel