I am not seeing this. Maybe it is something exposed by the fact we actually 
call del_procs correctly now.  I will try to take a look over the weekend.

-Nathan

________________________________________
From: devel [devel-boun...@open-mpi.org] on behalf of Thomas Naughton 
[naught...@ornl.gov]
Sent: Friday, May 16, 2014 11:43 AM
To: Open MPI Developers
Subject: Re: [OMPI devel] yesterday commits caused a crash in helloworld with 
--mca btl tcp, self

Hi,

I'm also seeing some sporadic failures with recent commits to trunk.
My tests are using slightly different build/configuration, and use
a different rte, but the errors are coming from the OMPI ob1 layer.

  works: r31777   (I did not test r31778..r31783)
  fails: r31784M  (plus manually applied patch from r31786)

My test was something simple:
     cd examples/
     mpicc -g hello_c.c -o hello_c
     mpirun -np 10 hello_c

Again it is sporadic, I was able to reproduce the failure with different
values of '-np' > 1; sometimes np=3, other times np=11.

Here's some backtrace / debug info...

Program terminated with signal 11, Segmentation fault.
[New process 7242]
[New process 7255]
#0  0xb7a7569f in mca_bml_base_btl_array_remove (array=0x81049ec,
     btl=0xb7a721c0) at ../../../../ompi/mca/bml/bml.h:139
139         if( array->bml_btls[i].btl == btl ) {
(gdb) bt
#0  0xb7a7569f in mca_bml_base_btl_array_remove (array=0x81049ec,
     btl=0xb7a721c0) at ../../../../ompi/mca/bml/bml.h:139
#1  0xb7a7539f in mca_bml_r2_del_proc_btl (proc=0x80debe8, btl=0xb7a721c0)
     at bml_r2.c:551
#2  0xb7a757d8 in mca_bml_r2_finalize () at bml_r2.c:648
#3  0xb70c50b8 in mca_pml_ob1_component_fini () at pml_ob1_component.c:290
#4  0xb7f5a755 in mca_pml_v_component_parasite_finalize ()
     at pml_v_component.c:161
#5  0xb7f58c63 in mca_pml_base_finalize () at base/pml_base_frame.c:120
#6  0xb7ec81e1 in ompi_mpi_finalize () at runtime/ompi_mpi_finalize.c:291
#7  0xb7ef1042 in PMPI_Finalize () at pfinalize.c:46
#8  0x0804874d in main (argc=2, argv=0xbfc8d394) at hello_c.c:24

(gdb) p array->bml_btls
$1 = (mca_bml_base_btl_t *) 0x0
(gdb) p btl
$2 = (struct mca_btl_base_module_t *) 0xb7a721c0
(gdb) p *btl
$3 = {btl_component = 0xb7a72240, btl_eager_limit = 131072,
   btl_rndv_eager_limit = 131072, btl_max_send_size = 262144,
   btl_rdma_pipeline_send_length = 2147483647,
   btl_rdma_pipeline_frag_size = 2147483647,
   btl_min_rdma_pipeline_size = 2147614719, btl_exclusivity = 65536,
   btl_latency = 0, btl_bandwidth = 100, btl_flags = 10, btl_seg_size = 16,
   btl_add_procs = 0xb7a6fd9c <mca_btl_self_add_procs>,
   btl_del_procs = 0xb7a6fdf9 <mca_btl_self_del_procs>, btl_register = 0,
   btl_finalize = 0xb7a6fe03 <mca_btl_self_finalize>,
   btl_alloc = 0xb7a6fe0d <mca_btl_self_alloc>,
   btl_free = 0xb7a70074 <mca_btl_self_free>,
   btl_prepare_src = 0xb7a70329 <mca_btl_self_prepare_src>,
   btl_prepare_dst = 0xb7a70702 <mca_btl_self_prepare_dst>,
   btl_send = 0xb7a70831 <mca_btl_self_send>, btl_sendi = 0,
   btl_put = 0xb7a70910 <mca_btl_self_rdma>,
   btl_get = 0xb7a70910 <mca_btl_self_rdma>,
   btl_dump = 0xb7f35b57 <mca_btl_base_dump>, btl_mpool = 0x0,
   btl_register_error = 0, btl_ft_event = 0xb7a70b00
<mca_btl_self_ft_event>}
(gdb) l
134                                                   struct
mca_btl_base_module_t* btl )
135 {
136     size_t i = 0;
137     /* find the btl */
138     for( i = 0; i < array->arr_size; i++ ) {
139         if( array->bml_btls[i].btl == btl ) {
140             /* make sure not to go out of bounds */
141             for( ; i < array->arr_size-1; i++ ) {
142                 /* move all btl's back by 1, so the found
143                    btl is "removed" */
(gdb) p array->arr_size
$4 = 69
(gdb) p array->bml_btls
$5 = (mca_bml_base_btl_t *) 0x0

Anyone else seeing problems?
--tjn

  _________________________________________________________________________
   Thomas Naughton                                      naught...@ornl.gov
   Research Associate                                   (865) 576-4184


On Fri, 16 May 2014, Gilles Gouaillardet wrote:

> Folks,
>
> a simple
> mpirun -np 2 -host localhost --mca btl,tcp mpi_helloworld
>
> crashes after some of yesterday's commits (i would blame r31778 and/or
> r31782,
> but i am not 100% sure)
>
> /* a list receives a negative value, so the program takes some time
> before crashing,
> symptom may vary from one system to an other */
>
> i digged into this, and found what looks like an old bug/typo in
> mca_bml_r2_del_procs().
> the bug has *not* been introduced by yesterday commits.
> i believe this path was not executed since yesterday, that is why we
> (only) now hit the bug
>
> i fixed this in r31786
>
> Gilles
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/05/14814.php
>
_______________________________________________
devel mailing list
de...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2014/05/14819.php

Reply via email to