Hi Edgar,

Could you send me your conf file?  I'll try to reproduce it.

Maybe run with --mca btl_base_verbose 20 or something to
see what the code that is parsing this field in the conf file
is finding.


Howard


-----Original Message-----
From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Edgar Gabriel
Sent: Thursday, August 28, 2014 3:40 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] segfault in openib component on trunk

to add another piece of information that I just found, the segfault only occurs 
if I have a particular mca parameter set in my mca-params.conf file, namely

btl_openib_receive_queues = S,12288,128,64,32:S,65536,128,64,3

Has the syntax for this parameter changed, or should/can I get rid of it?

Thanks
Edgar

On 08/28/2014 04:19 PM, Edgar Gabriel wrote:
> we are having recently problems running trunk with openib component 
> enabled on one of our clusters. The problem occurs right in the 
> initialization part, here is the stack right before the segfault:
>
> ---snip---
> (gdb) where
> #0  mca_btl_openib_tune_endpoint (openib_btl=0x762a40,
> endpoint=0x7d9660) at btl_openib.c:470
> #1  0x00007f1062f105c4 in mca_btl_openib_add_procs (btl=0x762a40, 
> nprocs=2, procs=0x759be0, peers=0x762440, reachable=0x7fff22dd16f0) at
> btl_openib.c:1093
> #2  0x00007f106316102c in mca_bml_r2_add_procs (nprocs=2, 
> procs=0x759be0, reachable=0x7fff22dd16f0) at bml_r2.c:201
> #3  0x00007f10615c0dd5 in mca_pml_ob1_add_procs (procs=0x70dc00,
> nprocs=2) at pml_ob1.c:334
> #4  0x00007f106823ed84 in ompi_mpi_init (argc=1, argv=0x7fff22dd1da8, 
> requested=0, provided=0x7fff22dd184c) at runtime/ompi_mpi_init.c:790
> #5  0x00007f1068273a2c in MPI_Init (argc=0x7fff22dd188c,
> argv=0x7fff22dd1880) at init.c:84
> #6  0x00000000004008e7 in main (argc=1, argv=0x7fff22dd1da8) at
> hello_world.c:13
> ---snip---
>
>
> in line 538 of the file containing the mca_btl_openib_tune_endpoint 
> routine, the strcmp operation fails, because  recv_qps is a NULL pointer.
>
>
> ---snip---
>
> if(0 != strcmp(mca_btl_openib_component.receive_queues, recv_qps)) {
>
> ---snip---
>
> Does anybody have an idea on what might be going wrong and how to 
> resolve it? Just to confirm, everything works perfectly with the 1.8 
> series on that very same  cluster
>
> Thanks
> Edgar
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/08/15746.php

_______________________________________________
devel mailing list
de...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2014/08/15747.php

Reply via email to