That certainly addresses part of the problem. I am working on a complete
revamp of the btl RDMA interface. It contains this fix:

https://github.com/hjelmn/ompi/commit/66fa429e306beb9fca59da0a4554e9b98d788316

-Nathan

On Tue, Nov 04, 2014 at 03:27:23PM -0600, Steve Wise wrote:
> I found the bug.  Here is the fix:
> 
> [root@stevo1 openib]# git diff
> diff --git a/opal/mca/btl/openib/btl_openib_component.c
> b/opal/mca/btl/openib/btl_openib_component.c
> index d876e21..8a5ea82 100644
> --- a/opal/mca/btl/openib/btl_openib_component.c
> +++ b/opal/mca/btl/openib/btl_openib_component.c
> @@ -1960,9 +1960,8 @@ static int init_one_device(opal_list_t *btl_list,
> struct ibv_device* ib_dev)
>          }
> 
>          /* If the MCA param was specified, skip all the checks */
> -        if ( MCA_BASE_VAR_SOURCE_COMMAND_LINE ||
> -                MCA_BASE_VAR_SOURCE_ENV ==
> -            mca_btl_openib_component.receive_queues_source) {
> +        if (MCA_BASE_VAR_SOURCE_COMMAND_LINE ==
> mca_btl_openib_component.receive_queues_source||
> +            MCA_BASE_VAR_SOURCE_ENV ==
> mca_btl_openib_component.receive_queues_source) {
>              goto good;
>          }
> 
> 
> On 11/4/2014 3:08 PM, Nathan Hjelm wrote:
> >I have run into the issue as well. I will open a pull request for 1.8.4
> >as part of a patch fixing the coalescing issues.
> >
> >-Nathan
> >
> >On Tue, Nov 04, 2014 at 02:50:30PM -0600, Steve Wise wrote:
> >>On 11/4/2014 2:09 PM, Steve Wise wrote:
> >>>Hi,
> >>>
> >>>I'm running ompi top-o-tree from github and seeing an openib btl issue
> >>>where the qp/srq configuration is incorrect for the given device id.  This
> >>>works fine in 1.8.4rc1, but I see the problem in top-of-tree.  A simple 2
> >>>node IMB-MPI1 pingpong fails to get the ranks setup.  I see this logged:
> >>>
> >>>/opt/ompi-trunk/bin/mpirun --allow-run-as-root --np 2 --host stevo1,stevo2
> >>>--mca btl openib,sm,self /opt/ompi-trunk/bin/IMB-MPI1 pingpong
> >>>
> >>Adding this works around the issue:
> >>
> >>--mca btl_openib_receive_queues P,65536,64
> >>
> >>I also confirmed that opal_btl_openib_ini_query() is getting the correct
> >>receive_queues string from the .ini file on both nodes for the cxgb4
> >>device...
> >>
> >>
> >>><snip>
> >>>
> >>>--------------------------------------------------------------------------
> >>>
> >>>The Open MPI receive queue configuration for the OpenFabrics devices
> >>>on two nodes are incompatible, meaning that MPI processes on two
> >>>specific nodes were unable to communicate with each other.  This
> >>>generally happens when you are using OpenFabrics devices from
> >>>different vendors on the same network.  You should be able to use the
> >>>mca_btl_openib_receive_queues MCA parameter to set a uniform receive
> >>>queue configuration for all the devices in the MPI job, and therefore
> >>>be able to run successfully.
> >>>
> >>>  Local host:       stevo2
> >>>  Local adapter:    cxgb4_0 (vendor 0x1425, part ID 21520)
> >>>  Local queues: 
> >>> P,128,256,192,128:S,2048,1024,1008,64:S,12288,1024,1008,64:S,65536,1024,1008,64
> >>>
> >>>  Remote host:      stevo1
> >>>  Remote adapter:   (vendor 0x1425, part ID 21520)
> >>>  Remote queues:    P,65536,64
> >>>----------------------------------------------------------------------------
> >>>
> >>>
> >>>The stevo1 rank has the correct queue settings: P,65536,64.  For some
> >>>reason, stevo2 has the wrong settings, even though it has the correct
> >>>device id info.
> >>>
> >>>Any suggestions on debugging this?  Like where to dig in the src to see if
> >>>somehow the .ini parsing is broken...
> >>>
> >>>
> >>>Thanks,
> >>>
> >>>Steve.
> >>>_______________________________________________
> >>>devel mailing list
> >>>de...@open-mpi.org
> >>>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>>Link to this post:
> >>>http://www.open-mpi.org/community/lists/devel/2014/11/16179.php
> >>_______________________________________________
> >>devel mailing list
> >>de...@open-mpi.org
> >>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>Link to this post: 
> >>http://www.open-mpi.org/community/lists/devel/2014/11/16180.php
> >>
> >>
> >>_______________________________________________
> >>devel mailing list
> >>de...@open-mpi.org
> >>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>Link to this post: 
> >>http://www.open-mpi.org/community/lists/devel/2014/11/16181.php
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/11/16182.php

Attachment: pgpHgdLvr09kA.pgp
Description: PGP signature

Reply via email to