I traced the problem to the BML component:
Index: ompi/mca/bml/r2/bml_r2.c
===================================================================
--- ompi/mca/bml/r2/bml_r2.c (revision 26191)
+++ ompi/mca/bml/r2/bml_r2.c (working copy)
@@ -105,6 +105,8 @@
}
}
if (NULL == btl_names_argv || NULL == btl_names_argv[i]) {
+ printf("\n\nR1: %p\n\n",
btl->btl_component->btl_version.mca_component_name);
+ printf("\n\nR2: %s\n\n",
btl->btl_component->btl_version.mca_component_name);
opal_argv_append_nosize(&btl_names_argv,
btl->btl_component->btl_version.mca_component_name);
}
I Get (white-spaces removed) for normal run:
R1: 0x7f820e3c31d8
R2: self
R1: 0x7f820e13c598
R2: tcp
... and for my module:
R1: 0x38
- and then the segmentation fault.
I guess it has something to do with the way I initialize my component -
I'll resume debugging after lunch.
Alex
On 03/31/2012 07:04 PM, Alex Margolin wrote:
P.S. I get the following Error - I'm pretty sure my BTL is to blame here:
alex@singularity:~/huji/benchmarks/simple$ mpirun -mca
btl_base_verbose 100 -mca btl self,mosix hello
[singularity:10838] mca: base: component_find: unable to open
/usr/local/lib/openmpi/mca_mpool_sm: libmca_common_sm.so.0: cannot
open shared object file: No such file or directory (ignored)
[singularity:10838] mca: base: components_open: Looking for btl
components
[singularity:10838] mca: base: components_open: opening btl components
[singularity:10838] mca: base: components_open: found loaded component
mosix
[singularity:10838] mca: base: components_open: component mosix
register function successful
[singularity:10838] mca: base: components_open: component mosix open
function successful
[singularity:10838] mca: base: components_open: found loaded component
self
[singularity:10838] mca: base: components_open: component self has no
register function
[singularity:10838] mca: base: components_open: component self open
function successful
[singularity:10838] mca: base: component_find: unable to open
/usr/local/lib/openmpi/mca_coll_sm: libmca_common_sm.so.0: cannot open
shared object file: No such file or directory (ignored)
[singularity:10838] select: initializing btl component mosix
[singularity:10838] select: init of component mosix returned success
[singularity:10838] select: initializing btl component self
[singularity:10838] select: init of component self returned success
[singularity:10838] *** Process received signal ***
[singularity:10838] Signal: Segmentation fault (11)
[singularity:10838] Signal code: Address not mapped (1)
[singularity:10838] Failing at address: 0x30
[singularity:10838] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x36420)
[0x7fa94a3cd420]
[singularity:10838] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x84391)
[0x7fa94a41b391]
[singularity:10838] [ 2]
/lib/x86_64-linux-gnu/libc.so.6(__strdup+0x16) [0x7fa94a41b086]
[singularity:10838] [ 3]
/usr/local/lib/libmpi.so.0(opal_argv_append_nosize+0xf7) [0x7fa94add66a4]
[singularity:10838] [ 4] /usr/local/lib/openmpi/mca_bml_r2.so(+0x1cf5)
[0x7fa946177cf5]
[singularity:10838] [ 5] /usr/local/lib/openmpi/mca_bml_r2.so(+0x1e50)
[0x7fa946177e50]
[singularity:10838] [ 6]
/usr/local/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_add_procs+0x12f)
[0x7fa946382b6d]
[singularity:10838] [ 7]
/usr/local/lib/libmpi.so.0(ompi_mpi_init+0x909) [0x7fa94acd1549]
[singularity:10838] [ 8] /usr/local/lib/libmpi.so.0(MPI_Init+0x16c)
[0x7fa94ad033ec]
[singularity:10838] [ 9]
/home/alex/huji/benchmarks/simple/hello(_ZN3MPI4InitERiRPPc+0x23)
[0x409e2d]
[singularity:10838] [10]
/home/alex/huji/benchmarks/simple/hello(main+0x22) [0x408f66]
[singularity:10838] [11]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed) [0x7fa94a3b830d]
[singularity:10838] [12] /home/alex/huji/benchmarks/simple/hello()
[0x408e89]
[singularity:10838] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 10838 on node singularity
exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
alex@singularity:~/huji/benchmarks/simple$ mpirun -mca btl self,tcp hello
[singularity:10841] mca: base: component_find: unable to open
/usr/local/lib/openmpi/mca_mpool_sm: libmca_common_sm.so.0: cannot
open shared object file: No such file or directory (ignored)
[singularity:10841] mca: base: component_find: unable to open
/usr/local/lib/openmpi/mca_coll_sm: libmca_common_sm.so.0: cannot open
shared object file: No such file or directory (ignored)
Hello world!
alex@singularity:~/huji/benchmarks/simple$