I am not seeing this. Maybe it is something exposed by the fact we actually call del_procs correctly now. I will try to take a look over the weekend.
-Nathan ________________________________________ From: devel [devel-boun...@open-mpi.org] on behalf of Thomas Naughton [naught...@ornl.gov] Sent: Friday, May 16, 2014 11:43 AM To: Open MPI Developers Subject: Re: [OMPI devel] yesterday commits caused a crash in helloworld with --mca btl tcp, self Hi, I'm also seeing some sporadic failures with recent commits to trunk. My tests are using slightly different build/configuration, and use a different rte, but the errors are coming from the OMPI ob1 layer. works: r31777 (I did not test r31778..r31783) fails: r31784M (plus manually applied patch from r31786) My test was something simple: cd examples/ mpicc -g hello_c.c -o hello_c mpirun -np 10 hello_c Again it is sporadic, I was able to reproduce the failure with different values of '-np' > 1; sometimes np=3, other times np=11. Here's some backtrace / debug info... Program terminated with signal 11, Segmentation fault. [New process 7242] [New process 7255] #0 0xb7a7569f in mca_bml_base_btl_array_remove (array=0x81049ec, btl=0xb7a721c0) at ../../../../ompi/mca/bml/bml.h:139 139 if( array->bml_btls[i].btl == btl ) { (gdb) bt #0 0xb7a7569f in mca_bml_base_btl_array_remove (array=0x81049ec, btl=0xb7a721c0) at ../../../../ompi/mca/bml/bml.h:139 #1 0xb7a7539f in mca_bml_r2_del_proc_btl (proc=0x80debe8, btl=0xb7a721c0) at bml_r2.c:551 #2 0xb7a757d8 in mca_bml_r2_finalize () at bml_r2.c:648 #3 0xb70c50b8 in mca_pml_ob1_component_fini () at pml_ob1_component.c:290 #4 0xb7f5a755 in mca_pml_v_component_parasite_finalize () at pml_v_component.c:161 #5 0xb7f58c63 in mca_pml_base_finalize () at base/pml_base_frame.c:120 #6 0xb7ec81e1 in ompi_mpi_finalize () at runtime/ompi_mpi_finalize.c:291 #7 0xb7ef1042 in PMPI_Finalize () at pfinalize.c:46 #8 0x0804874d in main (argc=2, argv=0xbfc8d394) at hello_c.c:24 (gdb) p array->bml_btls $1 = (mca_bml_base_btl_t *) 0x0 (gdb) p btl $2 = (struct mca_btl_base_module_t *) 0xb7a721c0 (gdb) p *btl $3 = {btl_component = 0xb7a72240, btl_eager_limit = 131072, btl_rndv_eager_limit = 131072, btl_max_send_size = 262144, btl_rdma_pipeline_send_length = 2147483647, btl_rdma_pipeline_frag_size = 2147483647, btl_min_rdma_pipeline_size = 2147614719, btl_exclusivity = 65536, btl_latency = 0, btl_bandwidth = 100, btl_flags = 10, btl_seg_size = 16, btl_add_procs = 0xb7a6fd9c <mca_btl_self_add_procs>, btl_del_procs = 0xb7a6fdf9 <mca_btl_self_del_procs>, btl_register = 0, btl_finalize = 0xb7a6fe03 <mca_btl_self_finalize>, btl_alloc = 0xb7a6fe0d <mca_btl_self_alloc>, btl_free = 0xb7a70074 <mca_btl_self_free>, btl_prepare_src = 0xb7a70329 <mca_btl_self_prepare_src>, btl_prepare_dst = 0xb7a70702 <mca_btl_self_prepare_dst>, btl_send = 0xb7a70831 <mca_btl_self_send>, btl_sendi = 0, btl_put = 0xb7a70910 <mca_btl_self_rdma>, btl_get = 0xb7a70910 <mca_btl_self_rdma>, btl_dump = 0xb7f35b57 <mca_btl_base_dump>, btl_mpool = 0x0, btl_register_error = 0, btl_ft_event = 0xb7a70b00 <mca_btl_self_ft_event>} (gdb) l 134 struct mca_btl_base_module_t* btl ) 135 { 136 size_t i = 0; 137 /* find the btl */ 138 for( i = 0; i < array->arr_size; i++ ) { 139 if( array->bml_btls[i].btl == btl ) { 140 /* make sure not to go out of bounds */ 141 for( ; i < array->arr_size-1; i++ ) { 142 /* move all btl's back by 1, so the found 143 btl is "removed" */ (gdb) p array->arr_size $4 = 69 (gdb) p array->bml_btls $5 = (mca_bml_base_btl_t *) 0x0 Anyone else seeing problems? --tjn _________________________________________________________________________ Thomas Naughton naught...@ornl.gov Research Associate (865) 576-4184 On Fri, 16 May 2014, Gilles Gouaillardet wrote: > Folks, > > a simple > mpirun -np 2 -host localhost --mca btl,tcp mpi_helloworld > > crashes after some of yesterday's commits (i would blame r31778 and/or > r31782, > but i am not 100% sure) > > /* a list receives a negative value, so the program takes some time > before crashing, > symptom may vary from one system to an other */ > > i digged into this, and found what looks like an old bug/typo in > mca_bml_r2_del_procs(). > the bug has *not* been introduced by yesterday commits. > i believe this path was not executed since yesterday, that is why we > (only) now hit the bug > > i fixed this in r31786 > > Gilles > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/05/14814.php > _______________________________________________ devel mailing list de...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this post: http://www.open-mpi.org/community/lists/devel/2014/05/14819.php