There is one other bug fix to address the message coalescing bug. The rest is the BTL RDMA revamp.
If there is a need I can probably pull those out and apply them to master sooner than SC. -Nathan On Tue, Nov 04, 2014 at 10:11:26PM +0000, Jeff Squyres (jsquyres) wrote: > It sounds like this fix should be merged in soon. > > Nathan: are your other changes bug fixes, or part of your BTL revamp branch? > > > On Nov 4, 2014, at 5:06 PM, Steve Wise <sw...@opengridcomputing.com> wrote: > > > Ok, sounds like I should let you continue the good work! :) When do you > > plan to merge this into ompi proper? > > > > > > On 11/4/2014 3:58 PM, Nathan Hjelm wrote: > >> That certainly addresses part of the problem. I am working on a complete > >> revamp of the btl RDMA interface. It contains this fix: > >> > >> > >> https://github.com/hjelmn/ompi/commit/66fa429e306beb9fca59da0a4554e9b98d788316 > >> > >> > >> -Nathan > >> > >> On Tue, Nov 04, 2014 at 03:27:23PM -0600, Steve Wise wrote: > >> > >>> I found the bug. Here is the fix: > >>> > >>> [root@stevo1 openib]# git diff > >>> diff --git a/opal/mca/btl/openib/btl_openib_component.c > >>> b/opal/mca/btl/openib/btl_openib_component.c > >>> index d876e21..8a5ea82 100644 > >>> --- a/opal/mca/btl/openib/btl_openib_component.c > >>> +++ b/opal/mca/btl/openib/btl_openib_component.c > >>> @@ -1960,9 +1960,8 @@ static int init_one_device(opal_list_t *btl_list, > >>> struct ibv_device* ib_dev) > >>> } > >>> > >>> /* If the MCA param was specified, skip all the checks */ > >>> - if ( MCA_BASE_VAR_SOURCE_COMMAND_LINE || > >>> - MCA_BASE_VAR_SOURCE_ENV == > >>> - mca_btl_openib_component.receive_queues_source) { > >>> + if (MCA_BASE_VAR_SOURCE_COMMAND_LINE == > >>> mca_btl_openib_component.receive_queues_source|| > >>> + MCA_BASE_VAR_SOURCE_ENV == > >>> mca_btl_openib_component.receive_queues_source) { > >>> goto good; > >>> } > >>> > >>> > >>> On 11/4/2014 3:08 PM, Nathan Hjelm wrote: > >>> > >>>> I have run into the issue as well. I will open a pull request for 1.8.4 > >>>> as part of a patch fixing the coalescing issues. > >>>> > >>>> -Nathan > >>>> > >>>> On Tue, Nov 04, 2014 at 02:50:30PM -0600, Steve Wise wrote: > >>>> > >>>>> On 11/4/2014 2:09 PM, Steve Wise wrote: > >>>>> > >>>>>> Hi, > >>>>>> > >>>>>> I'm running ompi top-o-tree from github and seeing an openib btl issue > >>>>>> where the qp/srq configuration is incorrect for the given device id. > >>>>>> This > >>>>>> works fine in 1.8.4rc1, but I see the problem in top-of-tree. A > >>>>>> simple 2 > >>>>>> node IMB-MPI1 pingpong fails to get the ranks setup. I see this > >>>>>> logged: > >>>>>> > >>>>>> /opt/ompi-trunk/bin/mpirun --allow-run-as-root --np 2 --host > >>>>>> stevo1,stevo2 > >>>>>> --mca btl openib,sm,self /opt/ompi-trunk/bin/IMB-MPI1 pingpong > >>>>>> > >>>>>> > >>>>> Adding this works around the issue: > >>>>> > >>>>> --mca btl_openib_receive_queues P,65536,64 > >>>>> > >>>>> I also confirmed that opal_btl_openib_ini_query() is getting the correct > >>>>> receive_queues string from the .ini file on both nodes for the cxgb4 > >>>>> device... > >>>>> > >>>>> > >>>>> > >>>>>> <snip> > >>>>>> > >>>>>> -------------------------------------------------------------------------- > >>>>>> > >>>>>> The Open MPI receive queue configuration for the OpenFabrics devices > >>>>>> on two nodes are incompatible, meaning that MPI processes on two > >>>>>> specific nodes were unable to communicate with each other. This > >>>>>> generally happens when you are using OpenFabrics devices from > >>>>>> different vendors on the same network. You should be able to use the > >>>>>> mca_btl_openib_receive_queues MCA parameter to set a uniform receive > >>>>>> queue configuration for all the devices in the MPI job, and therefore > >>>>>> be able to run successfully. > >>>>>> > >>>>>> Local host: stevo2 > >>>>>> Local adapter: cxgb4_0 (vendor 0x1425, part ID 21520) > >>>>>> Local queues: > >>>>>> P,128,256,192,128:S,2048,1024,1008,64:S,12288,1024,1008,64:S,65536,1024,1008,64 > >>>>>> > >>>>>> Remote host: stevo1 > >>>>>> Remote adapter: (vendor 0x1425, part ID 21520) > >>>>>> Remote queues: P,65536,64 > >>>>>> ---------------------------------------------------------------------------- > >>>>>> > >>>>>> > >>>>>> The stevo1 rank has the correct queue settings: P,65536,64. For some > >>>>>> reason, stevo2 has the wrong settings, even though it has the correct > >>>>>> device id info. > >>>>>> > >>>>>> Any suggestions on debugging this? Like where to dig in the src to > >>>>>> see if > >>>>>> somehow the .ini parsing is broken... > >>>>>> > >>>>>> > >>>>>> Thanks, > >>>>>> > >>>>>> Steve. > >>>>>> _______________________________________________ > >>>>>> devel mailing list > >>>>>> > >>>>>> de...@open-mpi.org > >>>>>> > >>>>>> Subscription: > >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>>>>> > >>>>>> Link to this post: > >>>>>> > >>>>>> http://www.open-mpi.org/community/lists/devel/2014/11/16179.php > >>>>> _______________________________________________ > >>>>> devel mailing list > >>>>> > >>>>> de...@open-mpi.org > >>>>> > >>>>> Subscription: > >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>>>> > >>>>> Link to this post: > >>>>> http://www.open-mpi.org/community/lists/devel/2014/11/16180.php > >>>>> > >>>>> > >>>>> > >>>>> _______________________________________________ > >>>>> devel mailing list > >>>>> > >>>>> de...@open-mpi.org > >>>>> > >>>>> Subscription: > >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>>>> > >>>>> Link to this post: > >>>>> http://www.open-mpi.org/community/lists/devel/2014/11/16181.php > >>> _______________________________________________ > >>> devel mailing list > >>> > >>> de...@open-mpi.org > >>> > >>> Subscription: > >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>> > >>> Link to this post: > >>> http://www.open-mpi.org/community/lists/devel/2014/11/16182.php > >>> > >>> > >>> _______________________________________________ > >>> devel mailing list > >>> > >>> de...@open-mpi.org > >>> > >>> Subscription: > >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>> > >>> Link to this post: > >>> http://www.open-mpi.org/community/lists/devel/2014/11/16184.php > > > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: > > http://www.open-mpi.org/community/lists/devel/2014/11/16185.php > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/11/16186.php
pgp9afLT34JIj.pgp
Description: PGP signature