There is one other bug fix to address the message coalescing bug. The
rest is the BTL RDMA revamp.

If there is a need I can probably pull those out and apply them to
master sooner than SC.

-Nathan

On Tue, Nov 04, 2014 at 10:11:26PM +0000, Jeff Squyres (jsquyres) wrote:
> It sounds like this fix should be merged in soon.
> 
> Nathan: are your other changes bug fixes, or part of your BTL revamp branch?
> 
> 
> On Nov 4, 2014, at 5:06 PM, Steve Wise <sw...@opengridcomputing.com> wrote:
> 
> > Ok, sounds like I should let you continue the good work! :)  When do you 
> > plan to merge this into ompi proper?
> > 
> > 
> > On 11/4/2014 3:58 PM, Nathan Hjelm wrote:
> >> That certainly addresses part of the problem. I am working on a complete
> >> revamp of the btl RDMA interface. It contains this fix:
> >> 
> >> 
> >> https://github.com/hjelmn/ompi/commit/66fa429e306beb9fca59da0a4554e9b98d788316
> >> 
> >> 
> >> -Nathan
> >> 
> >> On Tue, Nov 04, 2014 at 03:27:23PM -0600, Steve Wise wrote:
> >> 
> >>> I found the bug.  Here is the fix:
> >>> 
> >>> [root@stevo1 openib]# git diff
> >>> diff --git a/opal/mca/btl/openib/btl_openib_component.c
> >>> b/opal/mca/btl/openib/btl_openib_component.c
> >>> index d876e21..8a5ea82 100644
> >>> --- a/opal/mca/btl/openib/btl_openib_component.c
> >>> +++ b/opal/mca/btl/openib/btl_openib_component.c
> >>> @@ -1960,9 +1960,8 @@ static int init_one_device(opal_list_t *btl_list,
> >>> struct ibv_device* ib_dev)
> >>>          }
> >>> 
> >>>          /* If the MCA param was specified, skip all the checks */
> >>> -        if ( MCA_BASE_VAR_SOURCE_COMMAND_LINE ||
> >>> -                MCA_BASE_VAR_SOURCE_ENV ==
> >>> -            mca_btl_openib_component.receive_queues_source) {
> >>> +        if (MCA_BASE_VAR_SOURCE_COMMAND_LINE ==
> >>> mca_btl_openib_component.receive_queues_source||
> >>> +            MCA_BASE_VAR_SOURCE_ENV ==
> >>> mca_btl_openib_component.receive_queues_source) {
> >>>              goto good;
> >>>          }
> >>> 
> >>> 
> >>> On 11/4/2014 3:08 PM, Nathan Hjelm wrote:
> >>> 
> >>>> I have run into the issue as well. I will open a pull request for 1.8.4
> >>>> as part of a patch fixing the coalescing issues.
> >>>> 
> >>>> -Nathan
> >>>> 
> >>>> On Tue, Nov 04, 2014 at 02:50:30PM -0600, Steve Wise wrote:
> >>>> 
> >>>>> On 11/4/2014 2:09 PM, Steve Wise wrote:
> >>>>> 
> >>>>>> Hi,
> >>>>>> 
> >>>>>> I'm running ompi top-o-tree from github and seeing an openib btl issue
> >>>>>> where the qp/srq configuration is incorrect for the given device id.  
> >>>>>> This
> >>>>>> works fine in 1.8.4rc1, but I see the problem in top-of-tree.  A 
> >>>>>> simple 2
> >>>>>> node IMB-MPI1 pingpong fails to get the ranks setup.  I see this 
> >>>>>> logged:
> >>>>>> 
> >>>>>> /opt/ompi-trunk/bin/mpirun --allow-run-as-root --np 2 --host 
> >>>>>> stevo1,stevo2
> >>>>>> --mca btl openib,sm,self /opt/ompi-trunk/bin/IMB-MPI1 pingpong
> >>>>>> 
> >>>>>> 
> >>>>> Adding this works around the issue:
> >>>>> 
> >>>>> --mca btl_openib_receive_queues P,65536,64
> >>>>> 
> >>>>> I also confirmed that opal_btl_openib_ini_query() is getting the correct
> >>>>> receive_queues string from the .ini file on both nodes for the cxgb4
> >>>>> device...
> >>>>> 
> >>>>> 
> >>>>> 
> >>>>>> <snip>
> >>>>>> 
> >>>>>> --------------------------------------------------------------------------
> >>>>>> 
> >>>>>> The Open MPI receive queue configuration for the OpenFabrics devices
> >>>>>> on two nodes are incompatible, meaning that MPI processes on two
> >>>>>> specific nodes were unable to communicate with each other.  This
> >>>>>> generally happens when you are using OpenFabrics devices from
> >>>>>> different vendors on the same network.  You should be able to use the
> >>>>>> mca_btl_openib_receive_queues MCA parameter to set a uniform receive
> >>>>>> queue configuration for all the devices in the MPI job, and therefore
> >>>>>> be able to run successfully.
> >>>>>> 
> >>>>>>  Local host:       stevo2
> >>>>>>  Local adapter:    cxgb4_0 (vendor 0x1425, part ID 21520)
> >>>>>>  Local queues: 
> >>>>>> P,128,256,192,128:S,2048,1024,1008,64:S,12288,1024,1008,64:S,65536,1024,1008,64
> >>>>>> 
> >>>>>>  Remote host:      stevo1
> >>>>>>  Remote adapter:   (vendor 0x1425, part ID 21520)
> >>>>>>  Remote queues:    P,65536,64
> >>>>>> ----------------------------------------------------------------------------
> >>>>>> 
> >>>>>> 
> >>>>>> The stevo1 rank has the correct queue settings: P,65536,64.  For some
> >>>>>> reason, stevo2 has the wrong settings, even though it has the correct
> >>>>>> device id info.
> >>>>>> 
> >>>>>> Any suggestions on debugging this?  Like where to dig in the src to 
> >>>>>> see if
> >>>>>> somehow the .ini parsing is broken...
> >>>>>> 
> >>>>>> 
> >>>>>> Thanks,
> >>>>>> 
> >>>>>> Steve.
> >>>>>> _______________________________________________
> >>>>>> devel mailing list
> >>>>>> 
> >>>>>> de...@open-mpi.org
> >>>>>> 
> >>>>>> Subscription: 
> >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>>>>> 
> >>>>>> Link to this post:
> >>>>>> 
> >>>>>> http://www.open-mpi.org/community/lists/devel/2014/11/16179.php
> >>>>> _______________________________________________
> >>>>> devel mailing list
> >>>>> 
> >>>>> de...@open-mpi.org
> >>>>> 
> >>>>> Subscription: 
> >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>>>> 
> >>>>> Link to this post: 
> >>>>> http://www.open-mpi.org/community/lists/devel/2014/11/16180.php
> >>>>> 
> >>>>> 
> >>>>> 
> >>>>> _______________________________________________
> >>>>> devel mailing list
> >>>>> 
> >>>>> de...@open-mpi.org
> >>>>> 
> >>>>> Subscription: 
> >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>>>> 
> >>>>> Link to this post: 
> >>>>> http://www.open-mpi.org/community/lists/devel/2014/11/16181.php
> >>> _______________________________________________
> >>> devel mailing list
> >>> 
> >>> de...@open-mpi.org
> >>> 
> >>> Subscription: 
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>> 
> >>> Link to this post: 
> >>> http://www.open-mpi.org/community/lists/devel/2014/11/16182.php
> >>> 
> >>> 
> >>> _______________________________________________
> >>> devel mailing list
> >>> 
> >>> de...@open-mpi.org
> >>> 
> >>> Subscription: 
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>> 
> >>> Link to this post: 
> >>> http://www.open-mpi.org/community/lists/devel/2014/11/16184.php
> > 
> > _______________________________________________
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post: 
> > http://www.open-mpi.org/community/lists/devel/2014/11/16185.php
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/11/16186.php

Attachment: pgp9afLT34JIj.pgp
Description: PGP signature

Reply via email to