I have run into the issue as well. I will open a pull request for 1.8.4 as part of a patch fixing the coalescing issues.
-Nathan On Tue, Nov 04, 2014 at 02:50:30PM -0600, Steve Wise wrote: > On 11/4/2014 2:09 PM, Steve Wise wrote: > >Hi, > > > >I'm running ompi top-o-tree from github and seeing an openib btl issue > >where the qp/srq configuration is incorrect for the given device id. This > >works fine in 1.8.4rc1, but I see the problem in top-of-tree. A simple 2 > >node IMB-MPI1 pingpong fails to get the ranks setup. I see this logged: > > > >/opt/ompi-trunk/bin/mpirun --allow-run-as-root --np 2 --host stevo1,stevo2 > >--mca btl openib,sm,self /opt/ompi-trunk/bin/IMB-MPI1 pingpong > > > > Adding this works around the issue: > > --mca btl_openib_receive_queues P,65536,64 > > I also confirmed that opal_btl_openib_ini_query() is getting the correct > receive_queues string from the .ini file on both nodes for the cxgb4 > device... > > > ><snip> > > > >-------------------------------------------------------------------------- > > > >The Open MPI receive queue configuration for the OpenFabrics devices > >on two nodes are incompatible, meaning that MPI processes on two > >specific nodes were unable to communicate with each other. This > >generally happens when you are using OpenFabrics devices from > >different vendors on the same network. You should be able to use the > >mca_btl_openib_receive_queues MCA parameter to set a uniform receive > >queue configuration for all the devices in the MPI job, and therefore > >be able to run successfully. > > > > Local host: stevo2 > > Local adapter: cxgb4_0 (vendor 0x1425, part ID 21520) > > Local queues: > > P,128,256,192,128:S,2048,1024,1008,64:S,12288,1024,1008,64:S,65536,1024,1008,64 > > > > Remote host: stevo1 > > Remote adapter: (vendor 0x1425, part ID 21520) > > Remote queues: P,65536,64 > >---------------------------------------------------------------------------- > > > > > >The stevo1 rank has the correct queue settings: P,65536,64. For some > >reason, stevo2 has the wrong settings, even though it has the correct > >device id info. > > > >Any suggestions on debugging this? Like where to dig in the src to see if > >somehow the .ini parsing is broken... > > > > > >Thanks, > > > >Steve. > >_______________________________________________ > >devel mailing list > >de...@open-mpi.org > >Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > >Link to this post: > >http://www.open-mpi.org/community/lists/devel/2014/11/16179.php > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/11/16180.php
pgpoTwphTNFBB.pgp
Description: PGP signature