That certainly addresses part of the problem. I am working on a complete revamp of the btl RDMA interface. It contains this fix:
https://github.com/hjelmn/ompi/commit/66fa429e306beb9fca59da0a4554e9b98d788316 -Nathan On Tue, Nov 04, 2014 at 03:27:23PM -0600, Steve Wise wrote: > I found the bug. Here is the fix: > > [root@stevo1 openib]# git diff > diff --git a/opal/mca/btl/openib/btl_openib_component.c > b/opal/mca/btl/openib/btl_openib_component.c > index d876e21..8a5ea82 100644 > --- a/opal/mca/btl/openib/btl_openib_component.c > +++ b/opal/mca/btl/openib/btl_openib_component.c > @@ -1960,9 +1960,8 @@ static int init_one_device(opal_list_t *btl_list, > struct ibv_device* ib_dev) > } > > /* If the MCA param was specified, skip all the checks */ > - if ( MCA_BASE_VAR_SOURCE_COMMAND_LINE || > - MCA_BASE_VAR_SOURCE_ENV == > - mca_btl_openib_component.receive_queues_source) { > + if (MCA_BASE_VAR_SOURCE_COMMAND_LINE == > mca_btl_openib_component.receive_queues_source|| > + MCA_BASE_VAR_SOURCE_ENV == > mca_btl_openib_component.receive_queues_source) { > goto good; > } > > > On 11/4/2014 3:08 PM, Nathan Hjelm wrote: > >I have run into the issue as well. I will open a pull request for 1.8.4 > >as part of a patch fixing the coalescing issues. > > > >-Nathan > > > >On Tue, Nov 04, 2014 at 02:50:30PM -0600, Steve Wise wrote: > >>On 11/4/2014 2:09 PM, Steve Wise wrote: > >>>Hi, > >>> > >>>I'm running ompi top-o-tree from github and seeing an openib btl issue > >>>where the qp/srq configuration is incorrect for the given device id. This > >>>works fine in 1.8.4rc1, but I see the problem in top-of-tree. A simple 2 > >>>node IMB-MPI1 pingpong fails to get the ranks setup. I see this logged: > >>> > >>>/opt/ompi-trunk/bin/mpirun --allow-run-as-root --np 2 --host stevo1,stevo2 > >>>--mca btl openib,sm,self /opt/ompi-trunk/bin/IMB-MPI1 pingpong > >>> > >>Adding this works around the issue: > >> > >>--mca btl_openib_receive_queues P,65536,64 > >> > >>I also confirmed that opal_btl_openib_ini_query() is getting the correct > >>receive_queues string from the .ini file on both nodes for the cxgb4 > >>device... > >> > >> > >>><snip> > >>> > >>>-------------------------------------------------------------------------- > >>> > >>>The Open MPI receive queue configuration for the OpenFabrics devices > >>>on two nodes are incompatible, meaning that MPI processes on two > >>>specific nodes were unable to communicate with each other. This > >>>generally happens when you are using OpenFabrics devices from > >>>different vendors on the same network. You should be able to use the > >>>mca_btl_openib_receive_queues MCA parameter to set a uniform receive > >>>queue configuration for all the devices in the MPI job, and therefore > >>>be able to run successfully. > >>> > >>> Local host: stevo2 > >>> Local adapter: cxgb4_0 (vendor 0x1425, part ID 21520) > >>> Local queues: > >>> P,128,256,192,128:S,2048,1024,1008,64:S,12288,1024,1008,64:S,65536,1024,1008,64 > >>> > >>> Remote host: stevo1 > >>> Remote adapter: (vendor 0x1425, part ID 21520) > >>> Remote queues: P,65536,64 > >>>---------------------------------------------------------------------------- > >>> > >>> > >>>The stevo1 rank has the correct queue settings: P,65536,64. For some > >>>reason, stevo2 has the wrong settings, even though it has the correct > >>>device id info. > >>> > >>>Any suggestions on debugging this? Like where to dig in the src to see if > >>>somehow the .ini parsing is broken... > >>> > >>> > >>>Thanks, > >>> > >>>Steve. > >>>_______________________________________________ > >>>devel mailing list > >>>de...@open-mpi.org > >>>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>>Link to this post: > >>>http://www.open-mpi.org/community/lists/devel/2014/11/16179.php > >>_______________________________________________ > >>devel mailing list > >>de...@open-mpi.org > >>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>Link to this post: > >>http://www.open-mpi.org/community/lists/devel/2014/11/16180.php > >> > >> > >>_______________________________________________ > >>devel mailing list > >>de...@open-mpi.org > >>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>Link to this post: > >>http://www.open-mpi.org/community/lists/devel/2014/11/16181.php > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/11/16182.php
pgpHgdLvr09kA.pgp
Description: PGP signature