This is the email thread which sparked the problem:

http://www.open-mpi.org/community/lists/devel/2014/07/15329.php

I actually tried to apply the original CMR and couldn't get it to work in the 
1.8 branch - just kept having problems, so I pushed it off to 1.8.3. I'm leery 
to accept either of the current CMRs for two reasons: (a) none of the preceding 
changes is in the 1.8 series yet, and (b) it doesn't sound like we still have a 
complete solution.

Anyway, I just wanted to point to the original problem that was trying to be 
addressed.


On Aug 28, 2014, at 10:01 PM, Gilles Gouaillardet 
<gilles.gouaillar...@iferc.org> wrote:

> Howard and Edgar,
> 
> i fixed a few bugs (r32639 and r32642)
> 
> the bug is trivial to reproduce with any mpi hello world program
> 
> mpirun -np 2 --mca btl openib,self hello_world
> 
> after setting the mca param in the $HOME/.openmpi/mca-params.conf
> 
> $ cat ~/.openmpi/mca-params.conf
> btl_openib_receive_queues = S,12288,128,64,32:S,65536,128,64,3
> 
> good news is the program does not crash with a glory SIGSEGV any more
> bad news is the program will (nicely) abort for an incorrect reason :
> 
> --------------------------------------------------------------------------
> The Open MPI receive queue configuration for the OpenFabrics devices
> on two nodes are incompatible, meaning that MPI processes on two
> specific nodes were unable to communicate with each other.  This
> generally happens when you are using OpenFabrics devices from
> different vendors on the same network.  You should be able to use the
> mca_btl_openib_receive_queues MCA parameter to set a uniform receive
> queue configuration for all the devices in the MPI job, and therefore
> be able to run successfully.
> 
>  Local host:       node0
>  Local adapter:    mlx4_0 (vendor 0x2c9, part ID 4099)
>  Local queues:     S,12288,128,64,32:S,65536,128,64,3
> 
>  Remote host:      node0
>  Remote adapter:   (vendor 0x2c9, part ID 4099)
>  Remote queues:   
> P,128,256,192,128:S,2048,1024,1008,64:S,12288,1024,1008,64:S,65536,1024,1008,64
> 
> the root cause is the remote host did not send its receive_queues to the
> local host
> (and hence the local host believes the remote hosts uses the default value)
> 
> the logic was revamped vs v1.8, that is why v1.8 does not have such issue.
> 
> i am still thinking what should be the right fix :
> - one option is to send the receive queues
> - an other option would be to differenciate value overrided in
> mca-params.conf (should be always ok) of value overrided in the .ini
>  (might want to double check local and remote values match)
> 
> Cheers,
> 
> Gilles
> 
> On 2014/08/29 7:02, Pritchard Jr., Howard wrote:
>> Hi Edgar,
>> 
>> Could you send me your conf file?  I'll try to reproduce it.
>> 
>> Maybe run with --mca btl_base_verbose 20 or something to
>> see what the code that is parsing this field in the conf file
>> is finding.
>> 
>> 
>> Howard
>> 
>> 
>> -----Original Message-----
>> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Edgar Gabriel
>> Sent: Thursday, August 28, 2014 3:40 PM
>> To: Open MPI Developers
>> Subject: Re: [OMPI devel] segfault in openib component on trunk
>> 
>> to add another piece of information that I just found, the segfault only 
>> occurs if I have a particular mca parameter set in my mca-params.conf file, 
>> namely
>> 
>> btl_openib_receive_queues = S,12288,128,64,32:S,65536,128,64,3
>> 
>> Has the syntax for this parameter changed, or should/can I get rid of it?
>> 
>> Thanks
>> Edgar
>> 
>> On 08/28/2014 04:19 PM, Edgar Gabriel wrote:
>>> we are having recently problems running trunk with openib component 
>>> enabled on one of our clusters. The problem occurs right in the 
>>> initialization part, here is the stack right before the segfault:
>>> 
>>> ---snip---
>>> (gdb) where
>>> #0  mca_btl_openib_tune_endpoint (openib_btl=0x762a40,
>>> endpoint=0x7d9660) at btl_openib.c:470
>>> #1  0x00007f1062f105c4 in mca_btl_openib_add_procs (btl=0x762a40, 
>>> nprocs=2, procs=0x759be0, peers=0x762440, reachable=0x7fff22dd16f0) at
>>> btl_openib.c:1093
>>> #2  0x00007f106316102c in mca_bml_r2_add_procs (nprocs=2, 
>>> procs=0x759be0, reachable=0x7fff22dd16f0) at bml_r2.c:201
>>> #3  0x00007f10615c0dd5 in mca_pml_ob1_add_procs (procs=0x70dc00,
>>> nprocs=2) at pml_ob1.c:334
>>> #4  0x00007f106823ed84 in ompi_mpi_init (argc=1, argv=0x7fff22dd1da8, 
>>> requested=0, provided=0x7fff22dd184c) at runtime/ompi_mpi_init.c:790
>>> #5  0x00007f1068273a2c in MPI_Init (argc=0x7fff22dd188c,
>>> argv=0x7fff22dd1880) at init.c:84
>>> #6  0x00000000004008e7 in main (argc=1, argv=0x7fff22dd1da8) at
>>> hello_world.c:13
>>> ---snip---
>>> 
>>> 
>>> in line 538 of the file containing the mca_btl_openib_tune_endpoint 
>>> routine, the strcmp operation fails, because  recv_qps is a NULL pointer.
>>> 
>>> 
>>> ---snip---
>>> 
>>> if(0 != strcmp(mca_btl_openib_component.receive_queues, recv_qps)) {
>>> 
>>> ---snip---
>>> 
>>> Does anybody have an idea on what might be going wrong and how to 
>>> resolve it? Just to confirm, everything works perfectly with the 1.8 
>>> series on that very same  cluster
>>> 
>>> Thanks
>>> Edgar
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/devel/2014/08/15746.php
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/08/15747.php
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/08/15748.php
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/08/15749.php

Reply via email to