Hi,
I'm also seeing some sporadic failures with recent commits to trunk.
My tests are using slightly different build/configuration, and use
a different rte, but the errors are coming from the OMPI ob1 layer.
works: r31777 (I did not test r31778..r31783)
fails: r31784M (plus manually applied patch from r31786)
My test was something simple:
cd examples/
mpicc -g hello_c.c -o hello_c
mpirun -np 10 hello_c
Again it is sporadic, I was able to reproduce the failure with different
values of '-np' > 1; sometimes np=3, other times np=11.
Here's some backtrace / debug info...
Program terminated with signal 11, Segmentation fault.
[New process 7242]
[New process 7255]
#0 0xb7a7569f in mca_bml_base_btl_array_remove (array=0x81049ec,
btl=0xb7a721c0) at ../../../../ompi/mca/bml/bml.h:139
139 if( array->bml_btls[i].btl == btl ) {
(gdb) bt
#0 0xb7a7569f in mca_bml_base_btl_array_remove (array=0x81049ec,
btl=0xb7a721c0) at ../../../../ompi/mca/bml/bml.h:139
#1 0xb7a7539f in mca_bml_r2_del_proc_btl (proc=0x80debe8, btl=0xb7a721c0)
at bml_r2.c:551
#2 0xb7a757d8 in mca_bml_r2_finalize () at bml_r2.c:648
#3 0xb70c50b8 in mca_pml_ob1_component_fini () at pml_ob1_component.c:290
#4 0xb7f5a755 in mca_pml_v_component_parasite_finalize ()
at pml_v_component.c:161
#5 0xb7f58c63 in mca_pml_base_finalize () at base/pml_base_frame.c:120
#6 0xb7ec81e1 in ompi_mpi_finalize () at runtime/ompi_mpi_finalize.c:291
#7 0xb7ef1042 in PMPI_Finalize () at pfinalize.c:46
#8 0x0804874d in main (argc=2, argv=0xbfc8d394) at hello_c.c:24
(gdb) p array->bml_btls
$1 = (mca_bml_base_btl_t *) 0x0
(gdb) p btl
$2 = (struct mca_btl_base_module_t *) 0xb7a721c0
(gdb) p *btl
$3 = {btl_component = 0xb7a72240, btl_eager_limit = 131072,
btl_rndv_eager_limit = 131072, btl_max_send_size = 262144,
btl_rdma_pipeline_send_length = 2147483647,
btl_rdma_pipeline_frag_size = 2147483647,
btl_min_rdma_pipeline_size = 2147614719, btl_exclusivity = 65536,
btl_latency = 0, btl_bandwidth = 100, btl_flags = 10, btl_seg_size = 16,
btl_add_procs = 0xb7a6fd9c <mca_btl_self_add_procs>,
btl_del_procs = 0xb7a6fdf9 <mca_btl_self_del_procs>, btl_register = 0,
btl_finalize = 0xb7a6fe03 <mca_btl_self_finalize>,
btl_alloc = 0xb7a6fe0d <mca_btl_self_alloc>,
btl_free = 0xb7a70074 <mca_btl_self_free>,
btl_prepare_src = 0xb7a70329 <mca_btl_self_prepare_src>,
btl_prepare_dst = 0xb7a70702 <mca_btl_self_prepare_dst>,
btl_send = 0xb7a70831 <mca_btl_self_send>, btl_sendi = 0,
btl_put = 0xb7a70910 <mca_btl_self_rdma>,
btl_get = 0xb7a70910 <mca_btl_self_rdma>,
btl_dump = 0xb7f35b57 <mca_btl_base_dump>, btl_mpool = 0x0,
btl_register_error = 0, btl_ft_event = 0xb7a70b00
<mca_btl_self_ft_event>}
(gdb) l
134 struct
mca_btl_base_module_t* btl )
135 {
136 size_t i = 0;
137 /* find the btl */
138 for( i = 0; i < array->arr_size; i++ ) {
139 if( array->bml_btls[i].btl == btl ) {
140 /* make sure not to go out of bounds */
141 for( ; i < array->arr_size-1; i++ ) {
142 /* move all btl's back by 1, so the found
143 btl is "removed" */
(gdb) p array->arr_size
$4 = 69
(gdb) p array->bml_btls
$5 = (mca_bml_base_btl_t *) 0x0
Anyone else seeing problems?
--tjn
_________________________________________________________________________
Thomas Naughton naught...@ornl.gov
Research Associate (865) 576-4184
On Fri, 16 May 2014, Gilles Gouaillardet wrote:
Folks,
a simple
mpirun -np 2 -host localhost --mca btl,tcp mpi_helloworld
crashes after some of yesterday's commits (i would blame r31778 and/or
r31782,
but i am not 100% sure)
/* a list receives a negative value, so the program takes some time
before crashing,
symptom may vary from one system to an other */
i digged into this, and found what looks like an old bug/typo in
mca_bml_r2_del_procs().
the bug has *not* been introduced by yesterday commits.
i believe this path was not executed since yesterday, that is why we
(only) now hit the bug
i fixed this in r31786
Gilles
_______________________________________________
devel mailing list
de...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post:
http://www.open-mpi.org/community/lists/devel/2014/05/14814.php