I checked:

1. This option existed in v2.0.1, but it no longer exists in the 
soon-to-be-released v2.0.2.
2. Here's where we removed it: https://github.com/open-mpi/ompi/pull/2350

There's no rationale listed on that PR, but the reason is because it's stale 
and no longer works.

Sorry Dave.  :-\


> On Jan 12, 2017, at 6:54 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> 
> wrote:
> 
> Did we just recently discuss the openib BTL failover capability and decide 
> that it had bit-rotted?
> 
> If so, we need to amend our documentation and disable the code.
> 
> 
>> On Jan 11, 2017, at 3:11 PM, Dave Turner <drdavetur...@gmail.com> wrote:
>> 
>> 
>>     The btl_openib_receive_queues parameters that Howard provided
>> fixed our problem with getting 2.0.1 working with RoCE so thanks for
>> all the help.  However, we are seeing segfaults with this when
>> configured with --enable-btl-openib-failover.  I've included the 
>> configuration below that the package manager uses under Gentoo.
>> I also tested this after removing all of the redundant enable/disables,
>> and it's definitely the --enable-btl-openib-failover that causes 2.0.1
>> on RoCE to segfault.  I can enable debugging and recompile if more
>> information is needed.
>> 
>>     Could someone also explain why these parameters need to
>> be set explicitly for RoCE rather than being embedded in the code?
>> 
>>                   Dave
>> 
>> This is the configure line that our package manage generates:
>> ./configure --prefix=/usr --build=x86_64-pc-linux-gnu
>> --host=x86_64-pc-linux-gnu --mandir=/usr/share/man
>> --infodir=/usr/share/info --datadir=/usr/share --sysconfdir=/etc
>> --localstatedir=/var/lib --disable-dependency-tracking
>> --disable-silent-rules --docdir=/usr/share/doc/openmpi-2.0.1
>> --htmldir=/usr/share/doc/openmpi-2.0.1/html --libdir=/usr/lib64
>> --sysconfdir=/etc/openmpi --enable-pretty-print-stacktrace
>> --enable-orterun-prefix-by-default --with-hwloc=/usr
>> --with-libltdl=/usr --enable-mpi-fortran=all --enable-mpi-cxx
>> --without-cma --with-cuda=/opt/cuda --disable-io-romio
>> --disable-heterogeneous --enable-ipv6 --disable-java
>> --disable-mpi-java --disable-mpi-thread-multiple --without-verbs
>> --without-knem --without-psm --disable-openib-control-hdr-padding
>> --disable-openib-connectx-xrc --disable-openib-rdmacm
>> --disable-openib-udcm --disable-openib-dynamic-sl
>> --disable-btl-openib-failover --without-tm --without-slurm --with-sge
>> --enable-openib-connectx-xrc --enable-openib-rdmacm
>> --enable-openib-udcm --enable-openib-dynamic-sl
>> --enable-btl-openib-failover --with-verbs
>> 
>> On Thu, Jan 5, 2017 at 10:53 AM, Howard Pritchard <hpprit...@gmail.com> 
>> wrote:
>> Hi Dave,
>> 
>> Sorry for the delayed response.  
>> 
>> Anyway, you have to use rdmacm for connection management when using ROCE.
>> However, with 2.0.1 and later, you have to specify per peer QP info manually
>> on the mpirun command line.  
>> 
>> Could you try rerunning with
>> 
>> mpirun --mca btl_openib_receive_queues 
>> P,128,64,32,32,32:S,2048,1024,128,32:S, 
>> 12288,1024,128,32:S,65536,1024,128,32 (all the reset of the command line 
>> args)
>> 
>> and see if it then works?
>> 
>> Howard
>> 
>> 
>> 2017-01-04 16:37 GMT-07:00 Dave Turner <drdavetur...@gmail.com>:
>> --------------------------------------------------------------------------
>> No OpenFabrics connection schemes reported that they were able to be
>> used on a specific port.  As such, the openib BTL (OpenFabrics
>> support) will be disabled for this port.
>> 
>>  Local host:           elf22
>>  Local device:         mlx4_2
>>  Local port:           1
>>  CPCs attempted:       rdmacm, udcm
>> --------------------------------------------------------------------------
>> 
>>    I posted this to the user list but got no answer so I'm reposting to
>> the devel list.
>> 
>>    We recently upgraded to OpenMPI 2.0.1.  Everything works fine
>> on our QDR connections but we get the error above for our
>> 40 GbE connections running RoCE.  I traced through the code and
>> it looks like udcm cannot be used with RoCE.  I've also read that 
>> there are currently some problems with rdmacm under 2.0.1, which
>> would mean 2.0.1 does not currently work on RoCE.  We've tested
>> 10.4 using rdmacm and that works fine so I don't think we have anything
>> configured wrong on the RoCE side.  
>>     Could someone please verify whether this information is correct that
>> RoCE requires rdmacm only and not udcm, and that rdmacm is currently
>> not working.  If so, is it being worked on?
>> 
>>                     Dave
>> 
>> 
>> -- 
>> Work:     davetur...@ksu.edu     (785) 532-7791
>>             2219 Engineering Hall, Manhattan KS  66506
>> Home:    drdavetur...@gmail.com
>>              cell: (785) 770-5929
>> 
>> _______________________________________________
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>> 
>> 
>> 
>> 
>> -- 
>> Work:     davetur...@ksu.edu     (785) 532-7791
>>             2219 Engineering Hall, Manhattan KS  66506
>> Home:    drdavetur...@gmail.com
>>              cell: (785) 770-5929
>> _______________________________________________
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> 
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel


-- 
Jeff Squyres
jsquy...@cisco.com

_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Reply via email to