I've been seeing an intermittent (once every 4 hours looping on a quick initialization program) segv with the following stack trace.

=>[1] mca_btl_sm_add_procs(btl = 0xfffffd7ffdb67ef0, nprocs = 2U, procs = 0x591560, peers = 0x591580, reachability = 0xfffffd7fffdff000), line 519 in "btl_sm.c" [2] mca_bml_r2_add_procs(nprocs = 2U, procs = 0x591560, bml_endpoints = 0x591500, reachable = 0xfffffd7fffdff000), line 222 in "bml_r2.c" [3] mca_pml_ob1_add_procs(procs = 0x5914c0, nprocs = 2U), line 248 in "pml_ob1.c" [4] ompi_mpi_init(argc = 1, argv = 0xfffffd7fffdff318, requested = 0, provided = 0xfffffd7fffdff234), line 651 in "ompi_mpi_init.c" [5] PMPI_Init(argc = 0xfffffd7fffdff2ec, argv = 0xfffffd7fffdff2e0), line 90 in "pinit.c"
 [6] main(argc = 1, argv = 0xfffffd7fffdff318), line 82 in "buffer.c"

I believe the problem is that mca_btl_sm_component.shm_fifo[j] contains uninitialized data causes the loop on line 504 in btl_sm.c to think that a remote rank has set its fifo address.

Has anyone else seen the above happening?

--td

Reply via email to