I suspect the problem is here:

/**
+ * MOSIX BTL component.
+ */
+struct mca_btl_base_component_t {
+    mca_btl_base_component_2_0_0_t super;  /**< base BTL component */
+    mca_btl_mosix_module_t mosix_module;   /**< local module */
+};
+typedef struct mca_btl_base_component_t mca_btl_mosix_component_t;
+
+OMPI_MODULE_DECLSPEC extern mca_btl_mosix_component_t mca_btl_mosix_component;
+


You redefined the mca_btl_base_component_t struct. What we usually do is define 
a new struct:

struct mca_btl_mosix_component_t {
        mca_btl_base_component_t super;  /**< base BTL component */
        mca_btl_mosix_module_t mosix_module;   /**< local module */
};
typedef struct mca_btl_mosix_component_t mca_btl_mosix_component_t;

You can then overload that component with your additional info, leaving the 
base component to contain the required minimal elements.


On Apr 1, 2012, at 1:59 AM, Alex Margolin wrote:

> I traced the problem to the BML component:
> Index: ompi/mca/bml/r2/bml_r2.c
> ===================================================================
> --- ompi/mca/bml/r2/bml_r2.c    (revision 26191)
> +++ ompi/mca/bml/r2/bml_r2.c    (working copy)
> @@ -105,6 +105,8 @@
>             }
>         }
>         if (NULL == btl_names_argv || NULL == btl_names_argv[i]) {
> +            printf("\n\nR1: %p\n\n", 
> btl->btl_component->btl_version.mca_component_name);
> +            printf("\n\nR2: %s\n\n", 
> btl->btl_component->btl_version.mca_component_name);
>             opal_argv_append_nosize(&btl_names_argv,
>                                     
> btl->btl_component->btl_version.mca_component_name);
>         }
> 
> I Get (white-spaces removed) for normal run:
> R1: 0x7f820e3c31d8
> R2: self
> R1: 0x7f820e13c598
> R2: tcp
> ... and for my module:
> R1: 0x38
> - and then the segmentation fault.
> I guess it has something to do with the way I initialize my component - I'll 
> resume debugging after lunch.
> 
> Alex
> 
> On 03/31/2012 07:04 PM, Alex Margolin wrote:
>> 
>> P.S. I get the following Error - I'm pretty sure my BTL is to blame here:
>> 
>> alex@singularity:~/huji/benchmarks/simple$ mpirun -mca btl_base_verbose 100 
>> -mca btl self,mosix hello
>> [singularity:10838] mca: base: component_find: unable to open 
>> /usr/local/lib/openmpi/mca_mpool_sm: libmca_common_sm.so.0: cannot open 
>> shared object file: No such file or directory (ignored)
>> [singularity:10838] mca: base: components_open: Looking for btl components
>> [singularity:10838] mca: base: components_open: opening btl components
>> [singularity:10838] mca: base: components_open: found loaded component mosix
>> [singularity:10838] mca: base: components_open: component mosix register 
>> function successful
>> [singularity:10838] mca: base: components_open: component mosix open 
>> function successful
>> [singularity:10838] mca: base: components_open: found loaded component self
>> [singularity:10838] mca: base: components_open: component self has no 
>> register function
>> [singularity:10838] mca: base: components_open: component self open function 
>> successful
>> [singularity:10838] mca: base: component_find: unable to open 
>> /usr/local/lib/openmpi/mca_coll_sm: libmca_common_sm.so.0: cannot open 
>> shared object file: No such file or directory (ignored)
>> [singularity:10838] select: initializing btl component mosix
>> [singularity:10838] select: init of component mosix returned success
>> [singularity:10838] select: initializing btl component self
>> [singularity:10838] select: init of component self returned success
>> [singularity:10838] *** Process received signal ***
>> [singularity:10838] Signal: Segmentation fault (11)
>> [singularity:10838] Signal code: Address not mapped (1)
>> [singularity:10838] Failing at address: 0x30
>> [singularity:10838] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x36420) 
>> [0x7fa94a3cd420]
>> [singularity:10838] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x84391) 
>> [0x7fa94a41b391]
>> [singularity:10838] [ 2] /lib/x86_64-linux-gnu/libc.so.6(__strdup+0x16) 
>> [0x7fa94a41b086]
>> [singularity:10838] [ 3] 
>> /usr/local/lib/libmpi.so.0(opal_argv_append_nosize+0xf7) [0x7fa94add66a4]
>> [singularity:10838] [ 4] /usr/local/lib/openmpi/mca_bml_r2.so(+0x1cf5) 
>> [0x7fa946177cf5]
>> [singularity:10838] [ 5] /usr/local/lib/openmpi/mca_bml_r2.so(+0x1e50) 
>> [0x7fa946177e50]
>> [singularity:10838] [ 6] 
>> /usr/local/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_add_procs+0x12f) 
>> [0x7fa946382b6d]
>> [singularity:10838] [ 7] /usr/local/lib/libmpi.so.0(ompi_mpi_init+0x909) 
>> [0x7fa94acd1549]
>> [singularity:10838] [ 8] /usr/local/lib/libmpi.so.0(MPI_Init+0x16c) 
>> [0x7fa94ad033ec]
>> [singularity:10838] [ 9] 
>> /home/alex/huji/benchmarks/simple/hello(_ZN3MPI4InitERiRPPc+0x23) [0x409e2d]
>> [singularity:10838] [10] /home/alex/huji/benchmarks/simple/hello(main+0x22) 
>> [0x408f66]
>> [singularity:10838] [11] 
>> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed) [0x7fa94a3b830d]
>> [singularity:10838] [12] /home/alex/huji/benchmarks/simple/hello() [0x408e89]
>> [singularity:10838] *** End of error message ***
>> -------------------------------------------------------------------------- 
>> mpirun noticed that process rank 0 with PID 10838 on node singularity exited 
>> on signal 11 (Segmentation fault).
>> -------------------------------------------------------------------------- 
>> alex@singularity:~/huji/benchmarks/simple$ mpirun -mca btl self,tcp hello
>> [singularity:10841] mca: base: component_find: unable to open 
>> /usr/local/lib/openmpi/mca_mpool_sm: libmca_common_sm.so.0: cannot open 
>> shared object file: No such file or directory (ignored)
>> [singularity:10841] mca: base: component_find: unable to open 
>> /usr/local/lib/openmpi/mca_coll_sm: libmca_common_sm.so.0: cannot open 
>> shared object file: No such file or directory (ignored)
>> Hello world!
>> alex@singularity:~/huji/benchmarks/simple$
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Reply via email to