Hi,

For rdmacm to work with openib btl, the first receive queue needs to be a
point-to-point queue (and not SRQ which is the default in OMPI v2.x).
Can you please try adding this parameter to the command line?
-mca btl_openib_receive_queues P,4096,8,6,4
You can change the numbers according to what you need.

Alina.

On Tue, Jun 13, 2017 at 7:57 PM, Chuanxiong Guo <chuanxiong....@gmail.com>
wrote:

> here it is:
> ~/openmpi/bin/mpirun -np 2 -hostfile hostfile --mca btl openib,self,sm
> --mca btl_openib_cpc_include rdmacm  --mca btl_openib_rroce_enable 1
> ./sendrecv
>
> what I got as follows.
>
> --------------------------------------------------------------------------
> WARNING: There was an error initializing an OpenFabrics device.
>   Local host:   chguo-msr-linux1
>   Local device: mlx5_0
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> At least one pair of MPI processes are unable to reach each other for
> MPI communications.  This means that no Open MPI device has indicated
> that it can be used to communicate between these processes.  This is
> an error; Open MPI requires that all MPI processes be able to reach
> each other.  This error can sometimes be the result of forgetting to
> specify the "self" BTL.
>   Process 1 ([[45408,1],0]) is on host: chguo-msr-linux1
>   Process 2 ([[45408,1],1]) is on host: chguo-msr-linux02
>   BTLs attempted: self
> Your MPI job is now going to abort; sorry.
> --------------------------------------------------------------------------
> [chguo-msr-linux1:12690] *** An error occurred in MPI_Send
> [chguo-msr-linux1:12690] *** reported by process [140379686961153,
> 140376711102464]
> [chguo-msr-linux1:12690] *** on communicator MPI_COMM_WORLD
> [chguo-msr-linux1:12690] *** MPI_ERR_INTERN: internal error
> [chguo-msr-linux1:12690] *** MPI_ERRORS_ARE_FATAL (processes in this
> communicator will now abort,
> [chguo-msr-linux1:12690] ***    and potentially your MPI job)
> [chguo-msr-linux1:12684] 1 more process has sent help message
> help-mpi-btl-openib.txt / error in device init
> [chguo-msr-linux1:12684] Set MCA parameter "orte_base_help_aggregate" to 0
> to see all help / error messages
>
>
> On Tue, Jun 13, 2017 at 5:05 AM, Joshua Ladd <jladd.m...@gmail.com> wrote:
>
>> Hi,
>>
>> Please include your full command line.
>>
>> Josh
>>
>> On Mon, Jun 12, 2017 at 6:17 PM, Chuanxiong Guo <chuanxiong....@gmail.com
>> > wrote:
>>
>>> Hi,
>>>
>>> I have two servers with Mellanox CX4-LX (50GbE Ethernet) back-to-back
>>> connected. I am using Ubuntu 14-04. I have made mvapich2 work, and I can
>>> confirm both roce and rocev2 work well (by packet capturing).
>>>
>>> But I still cannot make openmpi work with roce. I am using openmpi
>>> 2.1.1.
>>> It looks that this version of openmpi does not recognize CX4-LX, which I
>>> have added vendor part id 4117 to mca-btl-openib-device-params.ini, and
>>> I have also updated opal/mca/common/verbs/common_verbs_port.c to
>>> support CX4-LX, which has speed 64 and width 1.
>>>
>>> But I am still getting:
>>>
>>> "WARNING: There was an error initializing an OpenFabrics device.
>>>   Local host:   chguo-msr-linux1
>>>
>>>   Local device: mlx5_0
>>> "
>>> Any hint on what are missing?
>>>
>>> Thanks,
>>> CX
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>>
>>
>>
>
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Reply via email to