I checked: 1. This option existed in v2.0.1, but it no longer exists in the soon-to-be-released v2.0.2. 2. Here's where we removed it: https://github.com/open-mpi/ompi/pull/2350
There's no rationale listed on that PR, but the reason is because it's stale and no longer works. Sorry Dave. :-\ > On Jan 12, 2017, at 6:54 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> > wrote: > > Did we just recently discuss the openib BTL failover capability and decide > that it had bit-rotted? > > If so, we need to amend our documentation and disable the code. > > >> On Jan 11, 2017, at 3:11 PM, Dave Turner <drdavetur...@gmail.com> wrote: >> >> >> The btl_openib_receive_queues parameters that Howard provided >> fixed our problem with getting 2.0.1 working with RoCE so thanks for >> all the help. However, we are seeing segfaults with this when >> configured with --enable-btl-openib-failover. I've included the >> configuration below that the package manager uses under Gentoo. >> I also tested this after removing all of the redundant enable/disables, >> and it's definitely the --enable-btl-openib-failover that causes 2.0.1 >> on RoCE to segfault. I can enable debugging and recompile if more >> information is needed. >> >> Could someone also explain why these parameters need to >> be set explicitly for RoCE rather than being embedded in the code? >> >> Dave >> >> This is the configure line that our package manage generates: >> ./configure --prefix=/usr --build=x86_64-pc-linux-gnu >> --host=x86_64-pc-linux-gnu --mandir=/usr/share/man >> --infodir=/usr/share/info --datadir=/usr/share --sysconfdir=/etc >> --localstatedir=/var/lib --disable-dependency-tracking >> --disable-silent-rules --docdir=/usr/share/doc/openmpi-2.0.1 >> --htmldir=/usr/share/doc/openmpi-2.0.1/html --libdir=/usr/lib64 >> --sysconfdir=/etc/openmpi --enable-pretty-print-stacktrace >> --enable-orterun-prefix-by-default --with-hwloc=/usr >> --with-libltdl=/usr --enable-mpi-fortran=all --enable-mpi-cxx >> --without-cma --with-cuda=/opt/cuda --disable-io-romio >> --disable-heterogeneous --enable-ipv6 --disable-java >> --disable-mpi-java --disable-mpi-thread-multiple --without-verbs >> --without-knem --without-psm --disable-openib-control-hdr-padding >> --disable-openib-connectx-xrc --disable-openib-rdmacm >> --disable-openib-udcm --disable-openib-dynamic-sl >> --disable-btl-openib-failover --without-tm --without-slurm --with-sge >> --enable-openib-connectx-xrc --enable-openib-rdmacm >> --enable-openib-udcm --enable-openib-dynamic-sl >> --enable-btl-openib-failover --with-verbs >> >> On Thu, Jan 5, 2017 at 10:53 AM, Howard Pritchard <hpprit...@gmail.com> >> wrote: >> Hi Dave, >> >> Sorry for the delayed response. >> >> Anyway, you have to use rdmacm for connection management when using ROCE. >> However, with 2.0.1 and later, you have to specify per peer QP info manually >> on the mpirun command line. >> >> Could you try rerunning with >> >> mpirun --mca btl_openib_receive_queues >> P,128,64,32,32,32:S,2048,1024,128,32:S, >> 12288,1024,128,32:S,65536,1024,128,32 (all the reset of the command line >> args) >> >> and see if it then works? >> >> Howard >> >> >> 2017-01-04 16:37 GMT-07:00 Dave Turner <drdavetur...@gmail.com>: >> -------------------------------------------------------------------------- >> No OpenFabrics connection schemes reported that they were able to be >> used on a specific port. As such, the openib BTL (OpenFabrics >> support) will be disabled for this port. >> >> Local host: elf22 >> Local device: mlx4_2 >> Local port: 1 >> CPCs attempted: rdmacm, udcm >> -------------------------------------------------------------------------- >> >> I posted this to the user list but got no answer so I'm reposting to >> the devel list. >> >> We recently upgraded to OpenMPI 2.0.1. Everything works fine >> on our QDR connections but we get the error above for our >> 40 GbE connections running RoCE. I traced through the code and >> it looks like udcm cannot be used with RoCE. I've also read that >> there are currently some problems with rdmacm under 2.0.1, which >> would mean 2.0.1 does not currently work on RoCE. We've tested >> 10.4 using rdmacm and that works fine so I don't think we have anything >> configured wrong on the RoCE side. >> Could someone please verify whether this information is correct that >> RoCE requires rdmacm only and not udcm, and that rdmacm is currently >> not working. If so, is it being worked on? >> >> Dave >> >> >> -- >> Work: davetur...@ksu.edu (785) 532-7791 >> 2219 Engineering Hall, Manhattan KS 66506 >> Home: drdavetur...@gmail.com >> cell: (785) 770-5929 >> >> _______________________________________________ >> devel mailing list >> devel@lists.open-mpi.org >> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >> >> >> >> >> -- >> Work: davetur...@ksu.edu (785) 532-7791 >> 2219 Engineering Hall, Manhattan KS 66506 >> Home: drdavetur...@gmail.com >> cell: (785) 770-5929 >> _______________________________________________ >> devel mailing list >> devel@lists.open-mpi.org >> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel > > > -- > Jeff Squyres > jsquy...@cisco.com > > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel -- Jeff Squyres jsquy...@cisco.com _______________________________________________ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel