George,

(one of the) faulty line was :

                   if (ORTE_SUCCESS != (rc =
opal_db.store((opal_identifier_t*)ORTE_PROC_MY_NAME, OPAL_SCOPE_INTERNAL,

OPAL_DB_LOCALLDR, (opal_identifier_t*)&proc, OPAL_ID_T))) {

so if proc is not 64 bits aligned, a SIGBUS will occur on sparc.
as you pointed, replacing OPAL_ID_T with ORTE_NAME will very likely fix
the issue (i have no arch to test...)

i was initially also "confused" with the following line

        if (ORTE_SUCCESS != (rc =
opal_db.store((opal_identifier_t*)&proc, OPAL_SCOPE_INTERNAL,
                                                ORTE_DB_NPROC_OFFSET,
&offset, OPAL_UINT32))) {

the first argument of store is an (opal_identifier_t *)
strictly speaking this is "a pointer to a 64 bits aligned address", and
proc might not be 64 bits aligned.
/* that being said, there is no crash :-) */

in this case, opal_db.store pointer points to the store function
(db_hash.c:178)
and proc is only used id memcpy at line 194, so 64 bits alignment is not
required.
(and comment is explicit :/* to protect alignment, copy the data across */

that might sounds pedantic, but are we doing the right thing here ?
(e.g. cast to (opal_identifier_t *), followed by a memcpy  in case the
pointer was not 64 bits aligned
vs always use aligned data ?)

Cheers,

Gilles

On 2014/08/08 14:58, George Bosilca wrote:
> This is a gigantic patch for an almost trivial issue. The current problem
> is purely related to the fact that in a single location (nidmap.c) the
> orte_process_name_t (which is a structure of 2 integers) is supposed to be
> aligned based on the uint64_t requirements. Bad assumption!
>
> Looking at the code one might notice that the orte_process_name_t is stored
> using a particular DSS type OPAL_ID_T. This is a shortcut that doesn't hold
> on the SPARC architecture because the two types (int32_t and int64_t) have
> different alignments.  However, ORTE define a type for orte_process_name_t.
> Thus, I think that if instead of saving the orte_process_name_t as an
> OPAL_ID_T, we save it as an ORTE_NAME the issue will go away.
>
>   George.
>
>
>
> On Fri, Aug 8, 2014 at 1:04 AM, Gilles Gouaillardet <
> gilles.gouaillar...@iferc.org> wrote:
>
>> Kawashima-san and all,
>>
>> Here is attached a one off patch for v1.8.
>> /* it does not use the __attribute__ modifier that might not be
>> supported by all compilers */
>>
>> as far as i am concerned, the same issue is also in the trunk,
>> and if you do not hit it, it just means you are lucky :-)
>>
>> the same issue might also be in other parts of the code :-(
>>
>> Cheers,
>>
>> Gilles
>>
>> On 2014/08/08 13:45, Kawashima, Takahiro wrote:
>>> Gilles, George,
>>>
>>> The problem is the one Gilles pointed.
>>> I temporarily modified the code bellow and the bus error disappeared.
>>>
>>> --- orte/util/nidmap.c  (revision 32447)
>>> +++ orte/util/nidmap.c  (working copy)
>>> @@ -885,7 +885,7 @@
>>>      orte_proc_state_t state;
>>>      orte_app_idx_t app_idx;
>>>      int32_t restarts;
>>> -    orte_process_name_t proc, dmn;
>>> +    orte_process_name_t proc __attribute__((__aligned__(8))), dmn;
>>>      char *hostname;
>>>      uint8_t flag;
>>>      opal_buffer_t *bptr;
>>>
>>> Takahiro Kawashima,
>>> MPI development team,
>>> Fujitsu
>>>
>>>> Kawashima-san,
>>>>
>>>> This is interesting :-)
>>>>
>>>> proc is in the stack and has type orte_process_name_t
>>>>
>>>> with
>>>>
>>>> typedef uint32_t orte_jobid_t;
>>>> typedef uint32_t orte_vpid_t;
>>>> struct orte_process_name_t {
>>>>     orte_jobid_t jobid;     /**< Job number */
>>>>     orte_vpid_t vpid;       /**< Process id - equivalent to rank */
>>>> };
>>>> typedef struct orte_process_name_t orte_process_name_t;
>>>>
>>>>
>>>> so there is really no reason to align this on 8 bytes...
>>>> but later, proc is casted into an uint64_t ...
>>>> so proc should have been aligned on 8 bytes but it is too late,
>>>> and hence the glory SIGBUS
>>>>
>>>>
>>>> this is loosely related to
>>>> http://www.open-mpi.org/community/lists/devel/2014/08/15532.php
>>>> (see heterogeneous.v2.patch)
>>>> if we make opal_process_name_t an union of uint64_t and a struct of two
>>>> uint32_t, the compiler
>>>> will align this on 8 bytes.
>>>> note the patch is not enough (and will not apply on the v1.8 branch
>> anyway),
>>>> we could simply remove orte_process_name_t and ompi_process_name_t and
>>>> use only
>>>> opal_process_name_t (and never declare variables with type
>>>> opal_proc_name_t otherwise alignment might be incorrect)
>>>>
>>>> as a workaround, you can declare an opal_process_name_t (for alignment),
>>>> and cast it to an orte_process_name_t
>>>>
>>>> i will write a patch (i will not be able to test on sparc ...)
>>>> please note this issue might be present in other places
>>>>
>>>> Cheers,
>>>>
>>>> Gilles
>>>>
>>>> On 2014/08/08 13:03, Kawashima, Takahiro wrote:
>>>>> Hi,
>>>>>
>>>>>>>>>> I have installed openmpi-1.8.2rc2 with gcc-4.9.0 on Solaris
>>>>>>>>>> 10 Sparc and I receive a bus error, if I run a small program.
>>>>> I've finally reproduced the bus error in my SPARC environment.
>>>>>
>>>>> #0 0xffffffff00db4740 (__waitpid_nocancel + 0x44)
>> (0x200,0x0,0x0,0xa0,0xfffff80100064af0,0x35b4)
>>>>> #1 0xffffffff0001a310 (handle_signal + 0x574) (signo=10,info=(struct
>> siginfo *) 0x000007feffffd100,p=(void *) 0x000007feffffd100) at line 277 in
>> ../sigattach.c <SIGNAL HANDLER>
>>>>> #2 0xffffffff0282aff4 (store + 0x540) (uid=(unsigned long *)
>> 0xffffffff0118a128,scope=8:'\b',key=(char *) 0xffffffff0106a0a8
>> "opal.local.ldr",data=(void *) 0x000007feffffde74,type=15:'\017') at line
>> 252 in db_hash.c
>>>>> #3 0xffffffff01266350 (opal_db_base_store + 0xc4) (proc=(unsigned long
>> *) 0xffffffff0118a128,scope=8:'\b',key=(char *) 0xffffffff0106a0a8
>> "opal.local.ldr",object=(void *) 0x000007feffffde74,type=15:'\017') at line
>> 49 in db_base_fns.c
>>>>> #4 0xffffffff00fdbab4 (orte_util_decode_pidmap + 0x790) (bo=(struct *)
>> 0x0000000000281d70) at line 975 in nidmap.c
>>>>> #5 0xffffffff00fd6d20 (orte_util_nidmap_init + 0x3dc) (buffer=(struct
>> opal_buffer_t *) 0x0000000000241fc0) at line 141 in nidmap.c
>>>>> #6 0xffffffff01e298cc (rte_init + 0x2a0) () at line 153 in
>> ess_env_module.c
>>>>> #7 0xffffffff00f9f28c (orte_init + 0x308) (pargc=(int *)
>> 0x0000000000000000,pargv=(char ***) 0x0000000000000000,flags=32) at line
>> 148 in orte_init.c
>>>>> #8 0xffffffff001a6f08 (ompi_mpi_init + 0x31c) (argc=1,argv=(char **)
>> 0x000007fefffff348,requested=0,provided=(int *) 0x000007feffffe698) at line
>> 464 in ompi_mpi_init.c
>>>>> #9 0xffffffff001ff79c (MPI_Init + 0x2b0) (argc=(int *)
>> 0x000007feffffe814,argv=(char ***) 0x000007feffffe818) at line 84 in init.c
>>>>> #10 0x0000000000100ae4 (main + 0x44) (argc=1,argv=(char **)
>> 0x000007fefffff348) at line 8 in mpiinitfinalize.c
>>>>> #11 0xffffffff00d2b81c (__libc_start_main + 0x194)
>> (0x100aa0,0x1,0x7fefffff348,0x100d24,0x100d14,0x0)
>>>>> #12 0x000000000010094c (_start + 0x2c) ()
>>>>>
>>>>> The line 252 in opal/mca/db/hash/db_hash.c is:
>>>>>
>>>>>     case OPAL_UINT64:
>>>>>         if (NULL == data) {
>>>>>             OPAL_ERROR_LOG(OPAL_ERR_BAD_PARAM);
>>>>>             return OPAL_ERR_BAD_PARAM;
>>>>>         }
>>>>>         kv->type = OPAL_UINT64;
>>>>>         kv->data.uint64 = *(uint64_t*)(data); // !!! here !!!
>>>>>         break;
>>>>>
>>>>> My environment is:
>>>>>
>>>>>   Open MPI v1.8 branch r32447 (latest)
>>>>>   configure --enable-debug
>>>>>   SPARC-V9 (Fujitsu SPARC64 IXfx)
>>>>>   Linux (custom)
>>>>>   gcc 4.2.4
>>>>>
>>>>> I could not reproduce it with Open MPI trunk nor with Fujitsu compiler.
>>>>>
>>>>> Can this information help?
>>>>>
>>>>> Takahiro Kawashima,
>>>>> MPI development team,
>>>>> Fujitsu
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I'm sorry once more to answer late, but the last two days our mail
>>>>>> server was down (hardware error).
>>>>>>
>>>>>>> Did you configure this --enable-debug?
>>>>>> Yes, I used the following command.
>>>>>>
>>>>>> ../openmpi-1.8.2rc3/configure
>> --prefix=/usr/local/openmpi-1.8.2_64_gcc \
>>>>>>   --libdir=/usr/local/openmpi-1.8.2_64_gcc/lib64 \
>>>>>>   --with-jdk-bindir=/usr/local/jdk1.8.0/bin \
>>>>>>   --with-jdk-headers=/usr/local/jdk1.8.0/include \
>>>>>>   JAVA_HOME=/usr/local/jdk1.8.0 \
>>>>>>   LDFLAGS="-m64 -L/usr/local/gcc-4.9.0/lib/amd64" \
>>>>>>   CC="gcc" CXX="g++" FC="gfortran" \
>>>>>>   CFLAGS="-m64" CXXFLAGS="-m64" FCFLAGS="-m64" \
>>>>>>   CPP="cpp" CXXCPP="cpp" \
>>>>>>   CPPFLAGS="" CXXCPPFLAGS="" \
>>>>>>   --enable-mpi-cxx \
>>>>>>   --enable-cxx-exceptions \
>>>>>>   --enable-mpi-java \
>>>>>>   --enable-heterogeneous \
>>>>>>   --enable-mpi-thread-multiple \
>>>>>>   --with-threads=posix \
>>>>>>   --with-hwloc=internal \
>>>>>>   --without-verbs \
>>>>>>   --with-wrapper-cflags="-std=c11 -m64" \
>>>>>>   --enable-debug \
>>>>>>   |& tee log.configure.$SYSTEM_ENV.$MACHINE_ENV.64_gcc
>>>>>>
>>>>>>
>>>>>>
>>>>>>> If so, you should get a line number in the backtrace
>>>>>> I got them for gdb (see below), but not for "dbx".
>>>>>>
>>>>>>
>>>>>> Kind regards
>>>>>>
>>>>>> Siegmar
>>>>>>
>>>>>>
>>>>>>
>>>>>>> On Aug 5, 2014, at 2:59 AM, Siegmar Gross
>>>>>> <siegmar.gr...@informatik.hs-fulda.de> wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I'm sorry to answer so late, but last week I didn't have Internet
>>>>>>>> access. In the meantime I've installed openmpi-1.8.2rc3 and I get
>>>>>>>> the same error.
>>>>>>>>
>>>>>>>>> This looks like the typical type of alignment error that we used
>>>>>>>>> to see when testing regularly on SPARC.  :-\
>>>>>>>>>
>>>>>>>>> It looks like the error was happening in mca_db_hash.so.  Could
>>>>>>>>> you get a stack trace / file+line number where it was failing
>>>>>>>>> in mca_db_hash?  (i.e., the actual bad code will likely be under
>>>>>>>>> opal/mca/db/hash somewhere)
>>>>>>>> Unfortunately I don't get a file+line number from a file in
>>>>>>>> opal/mca/db/Hash.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> tyr small_prog 102 ompi_info | grep MPI:
>>>>>>>>                Open MPI: 1.8.2rc3
>>>>>>>> tyr small_prog 103 which mpicc
>>>>>>>> /usr/local/openmpi-1.8.2_64_gcc/bin/mpicc
>>>>>>>> tyr small_prog 104 mpicc init_finalize.c
>>>>>>>> tyr small_prog 106 /opt/solstudio12.3/bin/sparcv9/dbx
>>>>>> /usr/local/openmpi-1.8.2_64_gcc/bin/mpiexec
>>>>>>>> For information about new features see `help changes'
>>>>>>>> To remove this message, put `dbxenv suppress_startup_message 7.9'
>> in your
>>>>>> .dbxrc
>>>>>>>> Reading mpiexec
>>>>>>>> Reading ld.so.1
>>>>>>>> Reading libopen-rte.so.7.0.4
>>>>>>>> Reading libopen-pal.so.6.2.0
>>>>>>>> Reading libsendfile.so.1
>>>>>>>> Reading libpicl.so.1
>>>>>>>> Reading libkstat.so.1
>>>>>>>> Reading liblgrp.so.1
>>>>>>>> Reading libsocket.so.1
>>>>>>>> Reading libnsl.so.1
>>>>>>>> Reading libgcc_s.so.1
>>>>>>>> Reading librt.so.1
>>>>>>>> Reading libm.so.2
>>>>>>>> Reading libpthread.so.1
>>>>>>>> Reading libc.so.1
>>>>>>>> Reading libdoor.so.1
>>>>>>>> Reading libaio.so.1
>>>>>>>> Reading libmd.so.1
>>>>>>>> (dbx) check -all
>>>>>>>> access checking - ON
>>>>>>>> memuse checking - ON
>>>>>>>> (dbx) run -np 1 a.outRunning: mpiexec -np 1 a.out
>>>>>>>> (process id 27833)
>>>>>>>> Reading rtcapihook.so
>>>>>>>> Reading libdl.so.1
>>>>>>>> Reading rtcaudit.so
>>>>>>>> Reading libmapmalloc.so.1
>>>>>>>> Reading libgen.so.1
>>>>>>>> Reading libc_psr.so.1
>>>>>>>> Reading rtcboot.so
>>>>>>>> Reading librtc.so
>>>>>>>> Reading libmd_psr.so.1
>>>>>>>> RTC: Enabling Error Checking...
>>>>>>>> RTC: Running program...
>>>>>>>> Write to unallocated (wua) on thread 1:
>>>>>>>> Attempting to write 1 byte at address 0xffffffff79f04000
>>>>>>>> t@1 (l@1) stopped in _readdir at 0xffffffff55174da0
>>>>>>>> 0xffffffff55174da0: _readdir+0x0064:    call
>>>>>> _PROCEDURE_LINKAGE_TABLE_+0x2380 [PLT] ! 0xffffffff55342a80
>>>>>>>> (dbx) where
>>>>>>>> current thread: t@1
>>>>>>>> =>[1] _readdir(0xffffffff79f00300, 0x2e6800, 0x4, 0x2d, 0x4,
>>>>>> 0xffffffff79f00300), at 0xffffffff55174da0
>>>>>>>>  [2] list_files_by_dir(0x100138fd8, 0xffffffff7fffd1f0,
>> 0xffffffff7fffd1e8,
>>>>>> 0xffffffff7fffd210, 0x0, 0xffffffff702a0010), at
>>>>>>>> 0xffffffff63174594
>>>>>>>>  [3] foreachfile_callback(0x100138fd8, 0xffffffff7fffd458, 0x0,
>> 0x2e, 0x0,
>>>>>> 0xffffffff702a0010), at 0xffffffff6317461c
>>>>>>>>  [4] foreach_dirinpath(0x1001d8a28, 0x0, 0xffffffff631745e0,
>>>>>> 0xffffffff7fffd458, 0x0, 0xffffffff702a0010), at 0xffffffff63171684
>>>>>>>>  [5] lt_dlforeachfile(0x1001d8a28, 0xffffffff6319656c, 0x0, 0x53,
>> 0x2f,
>>>>>> 0xf), at 0xffffffff63174748
>>>>>>>>  [6] find_dyn_components(0x0, 0xffffffff6323b570, 0x0, 0x1,
>>>>>> 0xffffffff7fffd6a0, 0xffffffff702a0010), at 0xffffffff63195e38
>>>>>>>>  [7] mca_base_component_find(0x0, 0xffffffff6323b570,
>> 0xffffffff6335e1b0,
>>>>>> 0x0, 0xffffffff7fffd6a0, 0x1), at 0xffffffff631954d8
>>>>>>>>  [8] mca_base_framework_components_register(0xffffffff6335e1c0,
>> 0x0, 0x3e,
>>>>>> 0x0, 0x3b, 0x100800), at 0xffffffff631b1638
>>>>>>>>  [9] mca_base_framework_register(0xffffffff6335e1c0, 0x0, 0x2,
>>>>>> 0xffffffff7fffd8d0, 0x0, 0xffffffff702a0010), at 0xffffffff631b24d4
>>>>>>>>  [10] mca_base_framework_open(0xffffffff6335e1c0, 0x0, 0x2,
>>>>>> 0xffffffff7fffd990, 0x0, 0xffffffff702a0010), at 0xffffffff631b25d0
>>>>>>>>  [11] opal_init(0xffffffff7fffdd70, 0xffffffff7fffdd78, 0x100117c60,
>>>>>> 0xffffffff7fffde58, 0x400, 0x100117c60), at
>>>>>>>> 0xffffffff63153694
>>>>>>>>  [12] orterun(0x4, 0xffffffff7fffde58, 0x2, 0xffffffff7fffdda0, 0x0,
>>>>>> 0xffffffff702a0010), at 0x100005078
>>>>>>>>  [13] main(0x4, 0xffffffff7fffde58, 0xffffffff7fffde80, 0x100117c60,
>>>>>> 0x100000000, 0xffffffff6a700200), at 0x100003d68
>>>>>>>> (dbx)
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I get the following output with gdb.
>>>>>>>>
>>>>>>>> tyr small_prog 107 /usr/local/gdb-7.6.1_64_gcc/bin/gdb
>>>>>> /usr/local/openmpi-1.8.2_64_gcc/bin/mpiexec
>>>>>>>> GNU gdb (GDB) 7.6.1
>>>>>>>> Copyright (C) 2013 Free Software Foundation, Inc.
>>>>>>>> License GPLv3+: GNU GPL version 3 or later
>>>>>> <http://gnu.org/licenses/gpl.html>
>>>>>>>> This is free software: you are free to change and redistribute it.
>>>>>>>> There is NO WARRANTY, to the extent permitted by law.  Type "show
>> copying"
>>>>>>>> and "show warranty" for details.
>>>>>>>> This GDB was configured as "sparc-sun-solaris2.10".
>>>>>>>> For bug reporting instructions, please see:
>>>>>>>> <http://www.gnu.org/software/gdb/bugs/>...
>>>>>>>> Reading symbols from
>>>>>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/bin/orterun...done.
>>>>>>>> (gdb) run -np 1 a.out
>>>>>>>> Starting program: /usr/local/openmpi-1.8.2_64_gcc/bin/mpiexec -np 1
>> a.out
>>>>>>>> [Thread debugging using libthread_db enabled]
>>>>>>>> [New Thread 1 (LWP 1)]
>>>>>>>> [New LWP    2        ]
>>>>>>>> [tyr:27867] *** Process received signal ***
>>>>>>>> [tyr:27867] Signal: Bus Error (10)
>>>>>>>> [tyr:27867] Signal code: Invalid address alignment (1)
>>>>>>>> [tyr:27867] Failing at address: ffffffff7fffd224
>>>>>>>>
>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:opal_b
>>>>>> acktrace_print+0x2c
>>>>>>
>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:0xccfa
>>>>>> 0
>>>>>>>> /lib/sparcv9/libc.so.1:0xd8b98
>>>>>>>> /lib/sparcv9/libc.so.1:0xcc70c
>>>>>>>> /lib/sparcv9/libc.so.1:0xcc918
>>>>>>>>
>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/openmpi/mca_db_hash.so:0x3e
>>>>>> e8 [ Signal 10 (BUS)]
>>>>>>
>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:opal_d
>>>>>> b_base_store+0xc8
>>>>>>
>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_u
>>>>>> til_decode_pidmap+0x798
>>>>>>
>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_u
>>>>>> til_nidmap_init+0x3cc
>>>>>>
>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/openmpi/mca_ess_env.so:0x22
>>>>>> 6c
>>>>>>
>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_i
>>>>>> nit+0x308
>>>>>>
>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libmpi.so.1.5.2:ompi_mpi_in
>>>>>> it+0x31c
>>>>>>
>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libmpi.so.1.5.2:PMPI_Init+0
>>>>>> x2a8
>>>>>>
>> /home/fd1026/work/skripte/master/parallel/prog/mpi/small_prog/a.out:main+0x20
>> /home/fd1026/work/skripte/master/parallel/prog/mpi/small_prog/a.out:_start+0x7c
>>>>>>>> [tyr:27867] *** End of error message ***
>>>>>>>>
>> --------------------------------------------------------------------------
>>>>>>>> mpiexec noticed that process rank 0 with PID 27867 on node tyr
>> exited on
>>>>>> signal 10 (Bus Error).
>> --------------------------------------------------------------------------
>>>>>>>> [LWP    2         exited]
>>>>>>>> [New Thread 2        ]
>>>>>>>> [Switching to Thread 1 (LWP 1)]
>>>>>>>> sol_thread_fetch_registers: td_ta_map_id2thr: no thread can be
>> found to
>>>>>> satisfy query
>>>>>>>> (gdb) bt
>>>>>>>> #0  0xffffffff7f6173d0 in rtld_db_dlactivity () from
>>>>>> /usr/lib/sparcv9/ld.so.1
>>>>>>>> #1  0xffffffff7f6175a8 in rd_event () from /usr/lib/sparcv9/ld.so.1
>>>>>>>> #2  0xffffffff7f618950 in lm_delete () from /usr/lib/sparcv9/ld.so.1
>>>>>>>> #3  0xffffffff7f6226bc in remove_so () from /usr/lib/sparcv9/ld.so.1
>>>>>>>> #4  0xffffffff7f624574 in remove_hdl () from
>> /usr/lib/sparcv9/ld.so.1
>>>>>>>> #5  0xffffffff7f61d97c in dlclose_core () from
>> /usr/lib/sparcv9/ld.so.1
>>>>>>>> #6  0xffffffff7f61d9d4 in dlclose_intn () from
>> /usr/lib/sparcv9/ld.so.1
>>>>>>>> #7  0xffffffff7f61db0c in dlclose () from /usr/lib/sparcv9/ld.so.1
>>>>>>>> #8  0xffffffff7ec7746c in vm_close ()
>>>>>>>>   from /usr/local/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6
>>>>>>>> #9  0xffffffff7ec74a4c in lt_dlclose ()
>>>>>>>>   from /usr/local/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6
>>>>>>>> #10 0xffffffff7ec99b70 in ri_destructor (obj=0x1001ead30)
>>>>>>>>    at
>> ../../../../openmpi-1.8.2rc3/opal/mca/base/mca_base_component_repository.c:391
>>>>>>>> #11 0xffffffff7ec98488 in opal_obj_run_destructors
>> (object=0x1001ead30)
>>>>>>>>    at ../../../../openmpi-1.8.2rc3/opal/class/opal_object.h:446
>>>>>>>> #12 0xffffffff7ec993ec in mca_base_component_repository_release (
>>>>>>>>    component=0xffffffff7b023cf0 <mca_oob_tcp_component>)
>>>>>>>>    at
>> ../../../../openmpi-1.8.2rc3/opal/mca/base/mca_base_component_repository.c:244
>>>>>>>> #13 0xffffffff7ec9b734 in mca_base_component_unload (
>>>>>>>>    component=0xffffffff7b023cf0 <mca_oob_tcp_component>,
>> output_id=-1)
>>>>>>>>    at
>> ../../../../openmpi-1.8.2rc3/opal/mca/base/mca_base_components_close.c:47
>>>>>>>> #14 0xffffffff7ec9b7c8 in mca_base_component_close (
>>>>>>>>    component=0xffffffff7b023cf0 <mca_oob_tcp_component>,
>> output_id=-1)
>>>>>>>>    at
>> ../../../../openmpi-1.8.2rc3/opal/mca/base/mca_base_components_close.c:60
>>>>>>>> #15 0xffffffff7ec9b89c in mca_base_components_close (output_id=-1,
>>>>>>>>    components=0xffffffff7f12b430 <orte_oob_base_framework+80>,
>> skip=0x0)
>>>>>>>> ---Type <return> to continue, or q <return> to quit---
>>>>>>>>    at
>> ../../../../openmpi-1.8.2rc3/opal/mca/base/mca_base_components_close.c:86
>>>>>>>> #16 0xffffffff7ec9b804 in mca_base_framework_components_close (
>>>>>>>>    framework=0xffffffff7f12b3e0 <orte_oob_base_framework>, skip=0x0)
>>>>>>>>    at
>> ../../../../openmpi-1.8.2rc3/opal/mca/base/mca_base_components_close.c:66
>>>>>>>> #17 0xffffffff7efae1e4 in orte_oob_base_close ()
>>>>>>>>    at
>> ../../../../openmpi-1.8.2rc3/orte/mca/oob/base/oob_base_frame.c:94
>>>>>>>> #18 0xffffffff7ecb28ac in mca_base_framework_close (
>>>>>>>>    framework=0xffffffff7f12b3e0 <orte_oob_base_framework>)
>>>>>>>>    at
>> ../../../../openmpi-1.8.2rc3/opal/mca/base/mca_base_framework.c:187
>>>>>>>> #19 0xffffffff7bf078c0 in rte_finalize ()
>>>>>>>>    at
>> ../../../../../openmpi-1.8.2rc3/orte/mca/ess/hnp/ess_hnp_module.c:858
>>>>>>>> #20 0xffffffff7ef30a44 in orte_finalize ()
>>>>>>>>    at ../../openmpi-1.8.2rc3/orte/runtime/orte_finalize.c:65
>>>>>>>> #21 0x00000001000070c4 in orterun (argc=4, argv=0xffffffff7fffe0e8)
>>>>>>>>    at ../../../../openmpi-1.8.2rc3/orte/tools/orterun/orterun.c:1096
>>>>>>>> #22 0x0000000100003d70 in main (argc=4, argv=0xffffffff7fffe0e8)
>>>>>>>>    at ../../../../openmpi-1.8.2rc3/orte/tools/orterun/main.c:13
>>>>>>>> (gdb)
>>>>>>>>
>>>>>>>>
>>>>>>>> Is the above information helpful to track down the error? Do you
>> need
>>>>>>>> anything else? Thank you very much for any help in advance.
>>>>>>>>
>>>>>>>>
>>>>>>>> Kind regards
>>>>>>>>
>>>>>>>> Siegmar
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> On Jul 25, 2014, at 2:08 AM, Siegmar Gross
>>>>>> <siegmar.gr...@informatik.hs-fulda.de> wrote:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I have installed openmpi-1.8.2rc2 with gcc-4.9.0 on Solaris
>>>>>>>>>> 10 Sparc and I receive a bus error, if I run a small program.
>>>>>>>>>>
>>>>>>>>>> tyr hello_1 105 mpiexec -np 2 a.out
>>>>>>>>>> [tyr:29164] *** Process received signal ***
>>>>>>>>>> [tyr:29164] Signal: Bus Error (10)
>>>>>>>>>> [tyr:29164] Signal code: Invalid address alignment (1)
>>>>>>>>>> [tyr:29164] Failing at address: ffffffff7fffd1c4
>>>>>>>>>>
>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:opal_b
>>>>>> acktrace_print+0x2c
>>>>>>
>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:0xccfd
>>>>>> 0
>>>>>>>>>> /lib/sparcv9/libc.so.1:0xd8b98
>>>>>>>>>> /lib/sparcv9/libc.so.1:0xcc70c
>>>>>>>>>> /lib/sparcv9/libc.so.1:0xcc918
>>>>>>>>>>
>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/openmpi/mca_db_hash.so:0x3e
>>>>>> e8 [ Signal 10 (BUS)]
>>>>>>
>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:opal_d
>>>>>> b_base_store+0xc8
>>>>>>
>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_u
>>>>>> til_decode_pidmap+0x798
>>>>>>
>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_u
>>>>>> til_nidmap_init+0x3cc
>>>>>>
>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/openmpi/mca_ess_env.so:0x22
>>>>>> 6c
>>>>>>
>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_i
>>>>>> nit+0x308
>>>>>>
>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libmpi.so.1.5.2:ompi_mpi_in
>>>>>> it+0x31c
>>>>>>
>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libmpi.so.1.5.2:PMPI_Init+0
>>>>>> x2a8
>> /home/fd1026/work/skripte/master/parallel/prog/mpi/hello_1/a.out:main+0x20
>> /home/fd1026/work/skripte/master/parallel/prog/mpi/hello_1/a.out:_start+0x7c
>>>>>>>>>> [tyr:29164] *** End of error message ***
>>>>>>>>>> ...
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I get the following output if I run the program in "dbx".
>>>>>>>>>>
>>>>>>>>>> ...
>>>>>>>>>> RTC: Enabling Error Checking...
>>>>>>>>>> RTC: Running program...
>>>>>>>>>> Write to unallocated (wua) on thread 1:
>>>>>>>>>> Attempting to write 1 byte at address 0xffffffff79f04000
>>>>>>>>>> t@1 (l@1) stopped in _readdir at 0xffffffff55174da0
>>>>>>>>>> 0xffffffff55174da0: _readdir+0x0064:    call
>>>>>> _PROCEDURE_LINKAGE_TABLE_+0x2380 [PLT] ! 0xffffffff55342a80
>>>>>>>>>> (dbx)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hopefully the above output helps to fix the error. Can I provide
>>>>>>>>>> anything else? Thank you very much for any help in advance.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Kind regards
>>>>>>>>>>
>>>>>>>>>> Siegmar
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/08/15546.php
>>
>>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/08/15547.php
>>
>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/08/15549.php

Reply via email to