Hi,

I'm also seeing some sporadic failures with recent commits to trunk.
My tests are using slightly different build/configuration, and use
a different rte, but the errors are coming from the OMPI ob1 layer.

 works: r31777   (I did not test r31778..r31783)
 fails: r31784M  (plus manually applied patch from r31786)

My test was something simple:
    cd examples/
    mpicc -g hello_c.c -o hello_c
    mpirun -np 10 hello_c

Again it is sporadic, I was able to reproduce the failure with different
values of '-np' > 1; sometimes np=3, other times np=11.

Here's some backtrace / debug info...

Program terminated with signal 11, Segmentation fault.
[New process 7242]
[New process 7255]
#0  0xb7a7569f in mca_bml_base_btl_array_remove (array=0x81049ec,
    btl=0xb7a721c0) at ../../../../ompi/mca/bml/bml.h:139
139         if( array->bml_btls[i].btl == btl ) {
(gdb) bt
#0  0xb7a7569f in mca_bml_base_btl_array_remove (array=0x81049ec,
    btl=0xb7a721c0) at ../../../../ompi/mca/bml/bml.h:139
#1  0xb7a7539f in mca_bml_r2_del_proc_btl (proc=0x80debe8, btl=0xb7a721c0)
    at bml_r2.c:551
#2  0xb7a757d8 in mca_bml_r2_finalize () at bml_r2.c:648
#3  0xb70c50b8 in mca_pml_ob1_component_fini () at pml_ob1_component.c:290
#4  0xb7f5a755 in mca_pml_v_component_parasite_finalize ()
    at pml_v_component.c:161
#5  0xb7f58c63 in mca_pml_base_finalize () at base/pml_base_frame.c:120
#6  0xb7ec81e1 in ompi_mpi_finalize () at runtime/ompi_mpi_finalize.c:291
#7  0xb7ef1042 in PMPI_Finalize () at pfinalize.c:46
#8  0x0804874d in main (argc=2, argv=0xbfc8d394) at hello_c.c:24

(gdb) p array->bml_btls
$1 = (mca_bml_base_btl_t *) 0x0
(gdb) p btl
$2 = (struct mca_btl_base_module_t *) 0xb7a721c0
(gdb) p *btl
$3 = {btl_component = 0xb7a72240, btl_eager_limit = 131072,
  btl_rndv_eager_limit = 131072, btl_max_send_size = 262144,
  btl_rdma_pipeline_send_length = 2147483647,
  btl_rdma_pipeline_frag_size = 2147483647,
  btl_min_rdma_pipeline_size = 2147614719, btl_exclusivity = 65536,
  btl_latency = 0, btl_bandwidth = 100, btl_flags = 10, btl_seg_size = 16,
  btl_add_procs = 0xb7a6fd9c <mca_btl_self_add_procs>,
  btl_del_procs = 0xb7a6fdf9 <mca_btl_self_del_procs>, btl_register = 0,
  btl_finalize = 0xb7a6fe03 <mca_btl_self_finalize>,
  btl_alloc = 0xb7a6fe0d <mca_btl_self_alloc>,
  btl_free = 0xb7a70074 <mca_btl_self_free>,
  btl_prepare_src = 0xb7a70329 <mca_btl_self_prepare_src>,
  btl_prepare_dst = 0xb7a70702 <mca_btl_self_prepare_dst>,
  btl_send = 0xb7a70831 <mca_btl_self_send>, btl_sendi = 0,
  btl_put = 0xb7a70910 <mca_btl_self_rdma>,
  btl_get = 0xb7a70910 <mca_btl_self_rdma>,
  btl_dump = 0xb7f35b57 <mca_btl_base_dump>, btl_mpool = 0x0,
  btl_register_error = 0, btl_ft_event = 0xb7a70b00
<mca_btl_self_ft_event>}
(gdb) l
134                                                   struct
mca_btl_base_module_t* btl )
135 { 136 size_t i = 0;
137     /* find the btl */
138     for( i = 0; i < array->arr_size; i++ ) {
139         if( array->bml_btls[i].btl == btl ) {
140             /* make sure not to go out of bounds */
141             for( ; i < array->arr_size-1; i++ ) {
142 /* move all btl's back by 1, so the found 143 btl is "removed" */
(gdb) p array->arr_size
$4 = 69
(gdb) p array->bml_btls
$5 = (mca_bml_base_btl_t *) 0x0

Anyone else seeing problems?
--tjn

 _________________________________________________________________________
  Thomas Naughton                                      naught...@ornl.gov
  Research Associate                                   (865) 576-4184


On Fri, 16 May 2014, Gilles Gouaillardet wrote:

Folks,

a simple
mpirun -np 2 -host localhost --mca btl,tcp mpi_helloworld

crashes after some of yesterday's commits (i would blame r31778 and/or
r31782,
but i am not 100% sure)

/* a list receives a negative value, so the program takes some time
before crashing,
symptom may vary from one system to an other */

i digged into this, and found what looks like an old bug/typo in
mca_bml_r2_del_procs().
the bug has *not* been introduced by yesterday commits.
i believe this path was not executed since yesterday, that is why we
(only) now hit the bug

i fixed this in r31786

Gilles
_______________________________________________
devel mailing list
de...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2014/05/14814.php

Reply via email to