I've been seeing an intermittent (once every 4 hours looping on a quick
initialization program) segv with the following stack trace.
=>[1] mca_btl_sm_add_procs(btl = 0xfffffd7ffdb67ef0, nprocs = 2U, procs
= 0x591560, peers = 0x591580, reachability = 0xfffffd7fffdff000), line
519 in "btl_sm.c"
[2] mca_bml_r2_add_procs(nprocs = 2U, procs = 0x591560, bml_endpoints
= 0x591500, reachable = 0xfffffd7fffdff000), line 222 in "bml_r2.c"
[3] mca_pml_ob1_add_procs(procs = 0x5914c0, nprocs = 2U), line 248 in
"pml_ob1.c"
[4] ompi_mpi_init(argc = 1, argv = 0xfffffd7fffdff318, requested = 0,
provided = 0xfffffd7fffdff234), line 651 in "ompi_mpi_init.c"
[5] PMPI_Init(argc = 0xfffffd7fffdff2ec, argv = 0xfffffd7fffdff2e0),
line 90 in "pinit.c"
[6] main(argc = 1, argv = 0xfffffd7fffdff318), line 82 in "buffer.c"
I believe the problem is that mca_btl_sm_component.shm_fifo[j] contains
uninitialized data causes the loop on line 504 in btl_sm.c to think that
a remote rank has set its fifo address.
Has anyone else seen the above happening?
--td