Hi Siegmar, Ralph,

I forgot to follow the previous report, sorry.
The patch I suggested is not included in Open MPI 1.8.2.
The backtrace Siegmar reported points the problem that I fixed
in the patch.

  http://www.open-mpi.org/community/lists/users/2014/08/24968.php

Siegmar:
Could you try my patch again?

Ralph (or someone committer):
Open MPI 1.8 needs custom patch that I posted. See my previous mail.
Could you review it and commit it to v1.8 branch?

Regards,
Takahiro

> Hi,
> 
> yesterday I installed openmpi-1.8.2 on my machines (Solaris 10 Sparc
> (tyr), Solaris 10 x86_64 (sunpc0), and openSUSE Linux 12.1 x86_64
> (linpc0)) with gcc-4.9.0. A small program works on some machines,
> but breaks with a bus error on Solaris 10 Sparc.
> 
> 
> tyr small_prog 118 which mpicc
> /usr/local/openmpi-1.8.2_64_gcc/bin/mpicc
> tyr small_prog 119 ompi_info | grep MPI:
>                 Open MPI: 1.8.2
> tyr small_prog 120 mpiexec -np 1 --host linpc0 init_finalize
> Hello!
> tyr small_prog 121 mpiexec -np 1 --host sunpc0 init_finalize
> Hello!
> tyr small_prog 122 mpiexec -np 1 --host tyr init_finalize
> [tyr:28081] *** Process received signal ***
> [tyr:28081] Signal: Bus Error (10)
> [tyr:28081] Signal code: Invalid address alignment (1)
> [tyr:28081] Failing at address: ffffffff7fffd304
> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:opal_backtrace_print+0x2c
> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:0xcd118
> /lib/sparcv9/libc.so.1:0xd8b98
> /lib/sparcv9/libc.so.1:0xcc70c
> /lib/sparcv9/libc.so.1:0xcc918
> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/openmpi/mca_db_hash.so:0x3ee8
>  [ Signal 10 (BUS)]
> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:opal_db_base_store+0xc8
> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_util_decode_pidmap+0x798
> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_util_nidmap_init+0x3cc
> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/openmpi/mca_ess_env.so:0x226c
> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_init+0x308
> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libmpi.so.1.5.2:ompi_mpi_init+0x31c
> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libmpi.so.1.5.2:MPI_Init+0x2a8
> /home/fd1026/SunOS/sparc/bin/init_finalize:main+0x10
> /home/fd1026/SunOS/sparc/bin/init_finalize:_start+0x7c
> [tyr:28081] *** End of error message ***
> --------------------------------------------------------------------------
> mpiexec noticed that process rank 0 with PID 28081 on node tyr exited on 
> signal 10 (Bus Error).
> --------------------------------------------------------------------------
> tyr small_prog 123 
> 
> 
> 
> gdb shows the following backtrace.
> 
> tyr small_prog 123 /usr/local/gdb-7.6.1_64_gcc/bin/gdb 
> /usr/local/openmpi-1.8.2_64_gcc/bin/mpiexec  
> GNU gdb (GDB) 7.6.1
> Copyright (C) 2013 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
> and "show warranty" for details.
> This GDB was configured as "sparc-sun-solaris2.10".
> For bug reporting instructions, please see:
> <http://www.gnu.org/software/gdb/bugs/>...
> Reading symbols from 
> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/bin/orterun...done.
> (gdb) run -np 1 --host tyr init_finalize
> Starting program: /usr/local/openmpi-1.8.2_64_gcc/bin/mpiexec -np 1 --host 
> tyr init_finalize
> [Thread debugging using libthread_db enabled]
> [New Thread 1 (LWP 1)]
> [New LWP    2        ]
> [tyr:28099] *** Process received signal ***
> [tyr:28099] Signal: Bus Error (10)
> [tyr:28099] Signal code: Invalid address alignment (1)
> [tyr:28099] Failing at address: ffffffff7fffd244
> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:opal_backtrace_print+0x2c
> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:0xcd118
> /lib/sparcv9/libc.so.1:0xd8b98
> /lib/sparcv9/libc.so.1:0xcc70c
> /lib/sparcv9/libc.so.1:0xcc918
> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/openmpi/mca_db_hash.so:0x3ee8
>  [ Signal 10 (BUS)]
> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:opal_db_base_store+0xc8
> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_util_decode_pidmap+0x798
> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_util_nidmap_init+0x3cc
> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/openmpi/mca_ess_env.so:0x226c
> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_init+0x308
> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libmpi.so.1.5.2:ompi_mpi_init+0x31c
> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libmpi.so.1.5.2:MPI_Init+0x2a8
> /home/fd1026/SunOS/sparc/bin/init_finalize:main+0x10
> /home/fd1026/SunOS/sparc/bin/init_finalize:_start+0x7c
> [tyr:28099] *** End of error message ***
> --------------------------------------------------------------------------
> mpiexec noticed that process rank 0 with PID 28099 on node tyr exited on 
> signal 10 (Bus Error).
> --------------------------------------------------------------------------
> [LWP    2         exited]
> [New Thread 2        ]
> [Switching to Thread 1 (LWP 1)]
> sol_thread_fetch_registers: td_ta_map_id2thr: no thread can be found to 
> satisfy query
> (gdb) bt
> #0  0xffffffff7f6173d0 in rtld_db_dlactivity () from /usr/lib/sparcv9/ld.so.1
> #1  0xffffffff7f6175a8 in rd_event () from /usr/lib/sparcv9/ld.so.1
> #2  0xffffffff7f618950 in lm_delete () from /usr/lib/sparcv9/ld.so.1
> #3  0xffffffff7f6226bc in remove_so () from /usr/lib/sparcv9/ld.so.1
> #4  0xffffffff7f624574 in remove_hdl () from /usr/lib/sparcv9/ld.so.1
> #5  0xffffffff7f61d97c in dlclose_core () from /usr/lib/sparcv9/ld.so.1
> #6  0xffffffff7f61d9d4 in dlclose_intn () from /usr/lib/sparcv9/ld.so.1
> #7  0xffffffff7f61db0c in dlclose () from /usr/lib/sparcv9/ld.so.1
> #8  0xffffffff7ec77474 in vm_close () from 
> /usr/local/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6
> #9  0xffffffff7ec74a54 in lt_dlclose ()
>    from /usr/local/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6
> #10 0xffffffff7ec99b78 in ri_destructor (obj=0x1001eada0)
>     at 
> ../../../../openmpi-1.8.2/opal/mca/base/mca_base_component_repository.c:391
> #11 0xffffffff7ec98490 in opal_obj_run_destructors (object=0x1001eada0)
>     at ../../../../openmpi-1.8.2/opal/class/opal_object.h:446
> #12 0xffffffff7ec993f4 in mca_base_component_repository_release (
>     component=0xffffffff7b023ef0 <mca_oob_tcp_component>)
>     at 
> ../../../../openmpi-1.8.2/opal/mca/base/mca_base_component_repository.c:244
> #13 0xffffffff7ec9b73c in mca_base_component_unload (
>     component=0xffffffff7b023ef0 <mca_oob_tcp_component>, output_id=-1)
>     at ../../../../openmpi-1.8.2/opal/mca/base/mca_base_components_close.c:47
> #14 0xffffffff7ec9b7d0 in mca_base_component_close (
>     component=0xffffffff7b023ef0 <mca_oob_tcp_component>, output_id=-1)
>     at ../../../../openmpi-1.8.2/opal/mca/base/mca_base_components_close.c:60
> #15 0xffffffff7ec9b8a4 in mca_base_components_close (output_id=-1, 
>     components=0xffffffff7f12b030 <orte_oob_base_framework+80>, skip=0x0)
>     at ../../../../openmpi-1.8.2/opal/mca/base/mca_base_components_close.c:86
> #16 0xffffffff7ec9b80c in mca_base_framework_components_close (
>     framework=0xffffffff7f12afe0 <orte_oob_base_framework>, skip=0x0)
>     at ../../../../openmpi-1.8.2/opal/mca/base/mca_base_components_close.c:66
> #17 0xffffffff7efae0e8 in orte_oob_base_close ()
>     at ../../../../openmpi-1.8.2/orte/mca/oob/base/oob_base_frame.c:94
> #18 0xffffffff7ecb28b4 in mca_base_framework_close (
>     framework=0xffffffff7f12afe0 <orte_oob_base_framework>)
>     at ../../../../openmpi-1.8.2/opal/mca/base/mca_base_framework.c:187
> #19 0xffffffff7bf078c0 in rte_finalize ()
>     at ../../../../../openmpi-1.8.2/orte/mca/ess/hnp/ess_hnp_module.c:858
> #20 0xffffffff7ef30924 in orte_finalize () at 
> ../../openmpi-1.8.2/orte/runtime/orte_finalize.c:65
> #21 0x00000001000070c4 in orterun (argc=6, argv=0xffffffff7fffe0e8)
>     at ../../../../openmpi-1.8.2/orte/tools/orterun/orterun.c:1096
> #22 0x0000000100003d70 in main (argc=6, argv=0xffffffff7fffe0e8)
>     at ../../../../openmpi-1.8.2/orte/tools/orterun/main.c:13
> (gdb) 
> 
> 
> I would be grateful, if somebody can fix the problem. Thank you
> very much for any help in advance.
> 
> 
> Kind regards
> 
> Siegmar
Index: opal/mca/db/hash/db_hash.c
===================================================================
--- opal/mca/db/hash/db_hash.c	(revision 32498)
+++ opal/mca/db/hash/db_hash.c	(working copy)
@@ -249,7 +249,8 @@
             return OPAL_ERR_BAD_PARAM;
         }
         kv->type = OPAL_UINT64;
-        kv->data.uint64 = *(uint64_t*)(data);
+        /* to avoid alignment issues */
+        memcpy(&kv->data.uint64, data, 8);
         break;
     case OPAL_UINT32:
         if (NULL == data) {
@@ -257,7 +258,8 @@
             return OPAL_ERR_BAD_PARAM;
         }
         kv->type = OPAL_UINT32;
-        kv->data.uint32 = *(uint32_t*)data;
+        /* to avoid alignment issues */
+        memcpy(&kv->data.uint32, data, 4);
         break;
     case OPAL_UINT16:
         if (NULL == data) {
@@ -265,7 +267,8 @@
             return OPAL_ERR_BAD_PARAM;
         }
         kv->type = OPAL_UINT16;
-        kv->data.uint16 = *(uint16_t*)(data);
+        /* to avoid alignment issues */
+        memcpy(&kv->data.uint16, data, 2);
         break;
     case OPAL_INT:
         if (NULL == data) {
@@ -273,7 +276,8 @@
             return OPAL_ERR_BAD_PARAM;
         }
         kv->type = OPAL_INT;
-        kv->data.integer = *(int*)(data);
+        /* to avoid alignment issues */
+        memcpy(&kv->data.integer, data, sizeof(int));
         break;
     case OPAL_UINT:
         if (NULL == data) {
@@ -281,7 +285,8 @@
             return OPAL_ERR_BAD_PARAM;
         }
         kv->type = OPAL_UINT;
-        kv->data.uint = *(unsigned int*)(data);
+        /* to avoid alignment issues */
+        memcpy(&kv->data.uint, data, sizeof(unsigned int));
         break;
     case OPAL_FLOAT:
         if (NULL == data) {

Reply via email to