Yes, I know - but the problem comes from nidmap pushing data down into the 
opal_db/dstore level, which then creates a copy of the data. That's where the 
alignment error is generated


On Aug 8, 2014, at 11:17 AM, George Bosilca <bosi...@icl.utk.edu> wrote:

> On Fri, Aug 8, 2014 at 5:21 AM, Ralph Castain <r...@open-mpi.org> wrote:
> Sorry to chime in a little late. George is likely correct about using 
> ORTE_NAME, only you can't do that as the OPAL layer has no idea what that 
> datatype looks like. This was the original reason for creating the 
> opal_identifier_t type - I had no other choice when we moved the db framework 
> (now dstore) to the OPAL layer in anticipation of the BTLs moving to OPAL. 
> The abstraction requirement wouldn't allow me to pass down the structure 
> definition.
> 
> We are talking about nidmap.c which has not yet been moved down to OPAL. 
> 
>   George.
>  
> 
> The easiest solution is probably to change the opal/db/hash code so that 
> 64-bit fields are memcpy'd instead of simply passed by "=". This should 
> eliminate the problem with the least fuss.
> 
> There is a performance penalty for using non-aligned data, and ideally we 
> should use aligned data whenever possible. This code isn't in the critical 
> path and so this is less of an issue, but still would be nice to do. However, 
> I didn't do so for the following reasons:
> 
> * I couldn't find a way for the compiler to check/require alignment down in 
> opal_db.store when passed a parameter. If someone knows of a way to do that, 
> please feel free to suggest it
> 
> * none of our current developers have access to a Solaris SPARC machine, and 
> thus our developers cannot detect violations when they occur
> 
> * the current solution avoids the issue, albeit with a slight performance 
> penalty
> 
> I'm open to alternative methods - I'm not happy with the ugliness this 
> required, but couldn't come up with a cleaner solution that would be easy for 
> developers to know when they violated the alignment requirement.
> 
> FWIW: it is possible, I suppose, that the other discussion about using an 
> opal_process_name_t that exactly mirrors orte_process_name_t could also 
> resolve this problem in a cleaner fashion. I didn't impose that requirement 
> here, but maybe it's another motivator for doing so?
> 
> Ralph
> 
> 
> On Aug 7, 2014, at 11:46 PM, Gilles Gouaillardet 
> <gilles.gouaillar...@iferc.org> wrote:
> 
>> George,
>> 
>> (one of the) faulty line was :
>> 
>>                    if (ORTE_SUCCESS != (rc = 
>> opal_db.store((opal_identifier_t*)ORTE_PROC_MY_NAME, OPAL_SCOPE_INTERNAL,
>>                                                             
>> OPAL_DB_LOCALLDR, (opal_identifier_t*)&proc, OPAL_ID_T))) {
>> 
>> so if proc is not 64 bits aligned, a SIGBUS will occur on sparc.
>> as you pointed, replacing OPAL_ID_T with ORTE_NAME will very likely fix the 
>> issue (i have no arch to test...)
>> 
>> i was initially also "confused" with the following line
>> 
>>         if (ORTE_SUCCESS != (rc = opal_db.store((opal_identifier_t*)&proc, 
>> OPAL_SCOPE_INTERNAL,
>>                                                 ORTE_DB_NPROC_OFFSET, 
>> &offset, OPAL_UINT32))) {
>> 
>> the first argument of store is an (opal_identifier_t *)
>> strictly speaking this is "a pointer to a 64 bits aligned address", and proc 
>> might not be 64 bits aligned.
>> /* that being said, there is no crash :-) */
>> 
>> in this case, opal_db.store pointer points to the store function 
>> (db_hash.c:178)
>> and proc is only used id memcpy at line 194, so 64 bits alignment is not 
>> required.
>> (and comment is explicit : /* to protect alignment, copy the data across */
>> 
>> that might sounds pedantic, but are we doing the right thing here ?
>> (e.g. cast to (opal_identifier_t *), followed by a memcpy  in case the 
>> pointer was not 64 bits aligned
>> vs always use aligned data ?)
>> 
>> Cheers,
>> 
>> Gilles
>> 
>> On 2014/08/08 14:58, George Bosilca wrote:
>>> This is a gigantic patch for an almost trivial issue. The current problem
>>> is purely related to the fact that in a single location (nidmap.c) the
>>> orte_process_name_t (which is a structure of 2 integers) is supposed to be
>>> aligned based on the uint64_t requirements. Bad assumption!
>>> 
>>> Looking at the code one might notice that the orte_process_name_t is stored
>>> using a particular DSS type OPAL_ID_T. This is a shortcut that doesn't hold
>>> on the SPARC architecture because the two types (int32_t and int64_t) have
>>> different alignments.  However, ORTE define a type for orte_process_name_t.
>>> Thus, I think that if instead of saving the orte_process_name_t as an
>>> OPAL_ID_T, we save it as an ORTE_NAME the issue will go away.
>>> 
>>>   George.
>>> 
>>> 
>>> 
>>> On Fri, Aug 8, 2014 at 1:04 AM, Gilles Gouaillardet <
>>> gilles.gouaillar...@iferc.org> wrote:
>>> 
>>>> Kawashima-san and all,
>>>> 
>>>> Here is attached a one off patch for v1.8.
>>>> /* it does not use the __attribute__ modifier that might not be
>>>> supported by all compilers */
>>>> 
>>>> as far as i am concerned, the same issue is also in the trunk,
>>>> and if you do not hit it, it just means you are lucky :-)
>>>> 
>>>> the same issue might also be in other parts of the code :-(
>>>> 
>>>> Cheers,
>>>> 
>>>> Gilles
>>>> 
>>>> On 2014/08/08 13:45, Kawashima, Takahiro wrote:
>>>>> Gilles, George,
>>>>> 
>>>>> The problem is the one Gilles pointed.
>>>>> I temporarily modified the code bellow and the bus error disappeared.
>>>>> 
>>>>> --- orte/util/nidmap.c  (revision 32447)
>>>>> +++ orte/util/nidmap.c  (working copy)
>>>>> @@ -885,7 +885,7 @@
>>>>>      orte_proc_state_t state;
>>>>>      orte_app_idx_t app_idx;
>>>>>      int32_t restarts;
>>>>> -    orte_process_name_t proc, dmn;
>>>>> +    orte_process_name_t proc __attribute__((__aligned__(8))), dmn;
>>>>>      char *hostname;
>>>>>      uint8_t flag;
>>>>>      opal_buffer_t *bptr;
>>>>> 
>>>>> Takahiro Kawashima,
>>>>> MPI development team,
>>>>> Fujitsu
>>>>> 
>>>>>> Kawashima-san,
>>>>>> 
>>>>>> This is interesting :-)
>>>>>> 
>>>>>> proc is in the stack and has type orte_process_name_t
>>>>>> 
>>>>>> with
>>>>>> 
>>>>>> typedef uint32_t orte_jobid_t;
>>>>>> typedef uint32_t orte_vpid_t;
>>>>>> struct orte_process_name_t {
>>>>>>     orte_jobid_t jobid;     /**< Job number */
>>>>>>     orte_vpid_t vpid;       /**< Process id - equivalent to rank */
>>>>>> };
>>>>>> typedef struct orte_process_name_t orte_process_name_t;
>>>>>> 
>>>>>> 
>>>>>> so there is really no reason to align this on 8 bytes...
>>>>>> but later, proc is casted into an uint64_t ...
>>>>>> so proc should have been aligned on 8 bytes but it is too late,
>>>>>> and hence the glory SIGBUS
>>>>>> 
>>>>>> 
>>>>>> this is loosely related to
>>>>>> http://www.open-mpi.org/community/lists/devel/2014/08/15532.php
>>>>>> (see heterogeneous.v2.patch)
>>>>>> if we make opal_process_name_t an union of uint64_t and a struct of two
>>>>>> uint32_t, the compiler
>>>>>> will align this on 8 bytes.
>>>>>> note the patch is not enough (and will not apply on the v1.8 branch
>>>> anyway),
>>>>>> we could simply remove orte_process_name_t and ompi_process_name_t and
>>>>>> use only
>>>>>> opal_process_name_t (and never declare variables with type
>>>>>> opal_proc_name_t otherwise alignment might be incorrect)
>>>>>> 
>>>>>> as a workaround, you can declare an opal_process_name_t (for alignment),
>>>>>> and cast it to an orte_process_name_t
>>>>>> 
>>>>>> i will write a patch (i will not be able to test on sparc ...)
>>>>>> please note this issue might be present in other places
>>>>>> 
>>>>>> Cheers,
>>>>>> 
>>>>>> Gilles
>>>>>> 
>>>>>> On 2014/08/08 13:03, Kawashima, Takahiro wrote:
>>>>>>> Hi,
>>>>>>> 
>>>>>>>>>>>> I have installed openmpi-1.8.2rc2 with gcc-4.9.0 on Solaris
>>>>>>>>>>>> 10 Sparc and I receive a bus error, if I run a small program.
>>>>>>> I've finally reproduced the bus error in my SPARC environment.
>>>>>>> 
>>>>>>> #0 0xffffffff00db4740 (__waitpid_nocancel + 0x44)
>>>> (0x200,0x0,0x0,0xa0,0xfffff80100064af0,0x35b4)
>>>>>>> #1 0xffffffff0001a310 (handle_signal + 0x574) (signo=10,info=(struct
>>>> siginfo *) 0x000007feffffd100,p=(void *) 0x000007feffffd100) at line 277 in
>>>> ../sigattach.c <SIGNAL HANDLER>
>>>>>>> #2 0xffffffff0282aff4 (store + 0x540) (uid=(unsigned long *)
>>>> 0xffffffff0118a128,scope=8:'\b',key=(char *) 0xffffffff0106a0a8
>>>> "opal.local.ldr",data=(void *) 0x000007feffffde74,type=15:'\017') at line
>>>> 252 in db_hash.c
>>>>>>> #3 0xffffffff01266350 (opal_db_base_store + 0xc4) (proc=(unsigned long
>>>> *) 0xffffffff0118a128,scope=8:'\b',key=(char *) 0xffffffff0106a0a8
>>>> "opal.local.ldr",object=(void *) 0x000007feffffde74,type=15:'\017') at line
>>>> 49 in db_base_fns.c
>>>>>>> #4 0xffffffff00fdbab4 (orte_util_decode_pidmap + 0x790) (bo=(struct *)
>>>> 0x0000000000281d70) at line 975 in nidmap.c
>>>>>>> #5 0xffffffff00fd6d20 (orte_util_nidmap_init + 0x3dc) (buffer=(struct
>>>> opal_buffer_t *) 0x0000000000241fc0) at line 141 in nidmap.c
>>>>>>> #6 0xffffffff01e298cc (rte_init + 0x2a0) () at line 153 in
>>>> ess_env_module.c
>>>>>>> #7 0xffffffff00f9f28c (orte_init + 0x308) (pargc=(int *)
>>>> 0x0000000000000000,pargv=(char ***) 0x0000000000000000,flags=32) at line
>>>> 148 in orte_init.c
>>>>>>> #8 0xffffffff001a6f08 (ompi_mpi_init + 0x31c) (argc=1,argv=(char **)
>>>> 0x000007fefffff348,requested=0,provided=(int *) 0x000007feffffe698) at line
>>>> 464 in ompi_mpi_init.c
>>>>>>> #9 0xffffffff001ff79c (MPI_Init + 0x2b0) (argc=(int *)
>>>> 0x000007feffffe814,argv=(char ***) 0x000007feffffe818) at line 84 in init.c
>>>>>>> #10 0x0000000000100ae4 (main + 0x44) (argc=1,argv=(char **)
>>>> 0x000007fefffff348) at line 8 in mpiinitfinalize.c
>>>>>>> #11 0xffffffff00d2b81c (__libc_start_main + 0x194)
>>>> (0x100aa0,0x1,0x7fefffff348,0x100d24,0x100d14,0x0)
>>>>>>> #12 0x000000000010094c (_start + 0x2c) ()
>>>>>>> 
>>>>>>> The line 252 in opal/mca/db/hash/db_hash.c is:
>>>>>>> 
>>>>>>>     case OPAL_UINT64:
>>>>>>>         if (NULL == data) {
>>>>>>>             OPAL_ERROR_LOG(OPAL_ERR_BAD_PARAM);
>>>>>>>             return OPAL_ERR_BAD_PARAM;
>>>>>>>         }
>>>>>>>         kv->type = OPAL_UINT64;
>>>>>>>         kv->data.uint64 = *(uint64_t*)(data); // !!! here !!!
>>>>>>>         break;
>>>>>>> 
>>>>>>> My environment is:
>>>>>>> 
>>>>>>>   Open MPI v1.8 branch r32447 (latest)
>>>>>>>   configure --enable-debug
>>>>>>>   SPARC-V9 (Fujitsu SPARC64 IXfx)
>>>>>>>   Linux (custom)
>>>>>>>   gcc 4.2.4
>>>>>>> 
>>>>>>> I could not reproduce it with Open MPI trunk nor with Fujitsu compiler.
>>>>>>> 
>>>>>>> Can this information help?
>>>>>>> 
>>>>>>> Takahiro Kawashima,
>>>>>>> MPI development team,
>>>>>>> Fujitsu
>>>>>>> 
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> I'm sorry once more to answer late, but the last two days our mail
>>>>>>>> server was down (hardware error).
>>>>>>>> 
>>>>>>>>> Did you configure this --enable-debug?
>>>>>>>> Yes, I used the following command.
>>>>>>>> 
>>>>>>>> ../openmpi-1.8.2rc3/configure
>>>> --prefix=/usr/local/openmpi-1.8.2_64_gcc \
>>>>>>>>   --libdir=/usr/local/openmpi-1.8.2_64_gcc/lib64 \
>>>>>>>>   --with-jdk-bindir=/usr/local/jdk1.8.0/bin \
>>>>>>>>   --with-jdk-headers=/usr/local/jdk1.8.0/include \
>>>>>>>>   JAVA_HOME=/usr/local/jdk1.8.0 \
>>>>>>>>   LDFLAGS="-m64 -L/usr/local/gcc-4.9.0/lib/amd64" \
>>>>>>>>   CC="gcc" CXX="g++" FC="gfortran" \
>>>>>>>>   CFLAGS="-m64" CXXFLAGS="-m64" FCFLAGS="-m64" \
>>>>>>>>   CPP="cpp" CXXCPP="cpp" \
>>>>>>>>   CPPFLAGS="" CXXCPPFLAGS="" \
>>>>>>>>   --enable-mpi-cxx \
>>>>>>>>   --enable-cxx-exceptions \
>>>>>>>>   --enable-mpi-java \
>>>>>>>>   --enable-heterogeneous \
>>>>>>>>   --enable-mpi-thread-multiple \
>>>>>>>>   --with-threads=posix \
>>>>>>>>   --with-hwloc=internal \
>>>>>>>>   --without-verbs \
>>>>>>>>   --with-wrapper-cflags="-std=c11 -m64" \
>>>>>>>>   --enable-debug \
>>>>>>>>   |& tee log.configure.$SYSTEM_ENV.$MACHINE_ENV.64_gcc
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> If so, you should get a line number in the backtrace
>>>>>>>> I got them for gdb (see below), but not for "dbx".
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Kind regards
>>>>>>>> 
>>>>>>>> Siegmar
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Aug 5, 2014, at 2:59 AM, Siegmar Gross
>>>>>>>> <siegmar.gr...@informatik.hs-fulda.de> wrote:
>>>>>>>>>> Hi,
>>>>>>>>>> 
>>>>>>>>>> I'm sorry to answer so late, but last week I didn't have Internet
>>>>>>>>>> access. In the meantime I've installed openmpi-1.8.2rc3 and I get
>>>>>>>>>> the same error.
>>>>>>>>>> 
>>>>>>>>>>> This looks like the typical type of alignment error that we used
>>>>>>>>>>> to see when testing regularly on SPARC.  :-\
>>>>>>>>>>> 
>>>>>>>>>>> It looks like the error was happening in mca_db_hash.so.  Could
>>>>>>>>>>> you get a stack trace / file+line number where it was failing
>>>>>>>>>>> in mca_db_hash?  (i.e., the actual bad code will likely be under
>>>>>>>>>>> opal/mca/db/hash somewhere)
>>>>>>>>>> Unfortunately I don't get a file+line number from a file in
>>>>>>>>>> opal/mca/db/Hash.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> tyr small_prog 102 ompi_info | grep MPI:
>>>>>>>>>>                Open MPI: 1.8.2rc3
>>>>>>>>>> tyr small_prog 103 which mpicc
>>>>>>>>>> /usr/local/openmpi-1.8.2_64_gcc/bin/mpicc
>>>>>>>>>> tyr small_prog 104 mpicc init_finalize.c
>>>>>>>>>> tyr small_prog 106 /opt/solstudio12.3/bin/sparcv9/dbx
>>>>>>>> /usr/local/openmpi-1.8.2_64_gcc/bin/mpiexec
>>>>>>>>>> For information about new features see `help changes'
>>>>>>>>>> To remove this message, put `dbxenv suppress_startup_message 7.9'
>>>> in your
>>>>>>>> .dbxrc
>>>>>>>>>> Reading mpiexec
>>>>>>>>>> Reading ld.so.1
>>>>>>>>>> Reading libopen-rte.so.7.0.4
>>>>>>>>>> Reading libopen-pal.so.6.2.0
>>>>>>>>>> Reading libsendfile.so.1
>>>>>>>>>> Reading libpicl.so.1
>>>>>>>>>> Reading libkstat.so.1
>>>>>>>>>> Reading liblgrp.so.1
>>>>>>>>>> Reading libsocket.so.1
>>>>>>>>>> Reading libnsl.so.1
>>>>>>>>>> Reading libgcc_s.so.1
>>>>>>>>>> Reading librt.so.1
>>>>>>>>>> Reading libm.so.2
>>>>>>>>>> Reading libpthread.so.1
>>>>>>>>>> Reading libc.so.1
>>>>>>>>>> Reading libdoor.so.1
>>>>>>>>>> Reading libaio.so.1
>>>>>>>>>> Reading libmd.so.1
>>>>>>>>>> (dbx) check -all
>>>>>>>>>> access checking - ON
>>>>>>>>>> memuse checking - ON
>>>>>>>>>> (dbx) run -np 1 a.outRunning: mpiexec -np 1 a.out
>>>>>>>>>> (process id 27833)
>>>>>>>>>> Reading rtcapihook.so
>>>>>>>>>> Reading libdl.so.1
>>>>>>>>>> Reading rtcaudit.so
>>>>>>>>>> Reading libmapmalloc.so.1
>>>>>>>>>> Reading libgen.so.1
>>>>>>>>>> Reading libc_psr.so.1
>>>>>>>>>> Reading rtcboot.so
>>>>>>>>>> Reading librtc.so
>>>>>>>>>> Reading libmd_psr.so.1
>>>>>>>>>> RTC: Enabling Error Checking...
>>>>>>>>>> RTC: Running program...
>>>>>>>>>> Write to unallocated (wua) on thread 1:
>>>>>>>>>> Attempting to write 1 byte at address 0xffffffff79f04000
>>>>>>>>>> t@1 (l@1) stopped in _readdir at 0xffffffff55174da0
>>>>>>>>>> 0xffffffff55174da0: _readdir+0x0064:    call
>>>>>>>> _PROCEDURE_LINKAGE_TABLE_+0x2380 [PLT] ! 0xffffffff55342a80
>>>>>>>>>> (dbx) where
>>>>>>>>>> current thread: t@1
>>>>>>>>>> =>[1] _readdir(0xffffffff79f00300, 0x2e6800, 0x4, 0x2d, 0x4,
>>>>>>>> 0xffffffff79f00300), at 0xffffffff55174da0
>>>>>>>>>>  [2] list_files_by_dir(0x100138fd8, 0xffffffff7fffd1f0,
>>>> 0xffffffff7fffd1e8,
>>>>>>>> 0xffffffff7fffd210, 0x0, 0xffffffff702a0010), at
>>>>>>>>>> 0xffffffff63174594
>>>>>>>>>>  [3] foreachfile_callback(0x100138fd8, 0xffffffff7fffd458, 0x0,
>>>> 0x2e, 0x0,
>>>>>>>> 0xffffffff702a0010), at 0xffffffff6317461c
>>>>>>>>>>  [4] foreach_dirinpath(0x1001d8a28, 0x0, 0xffffffff631745e0,
>>>>>>>> 0xffffffff7fffd458, 0x0, 0xffffffff702a0010), at 0xffffffff63171684
>>>>>>>>>>  [5] lt_dlforeachfile(0x1001d8a28, 0xffffffff6319656c, 0x0, 0x53,
>>>> 0x2f,
>>>>>>>> 0xf), at 0xffffffff63174748
>>>>>>>>>>  [6] find_dyn_components(0x0, 0xffffffff6323b570, 0x0, 0x1,
>>>>>>>> 0xffffffff7fffd6a0, 0xffffffff702a0010), at 0xffffffff63195e38
>>>>>>>>>>  [7] mca_base_component_find(0x0, 0xffffffff6323b570,
>>>> 0xffffffff6335e1b0,
>>>>>>>> 0x0, 0xffffffff7fffd6a0, 0x1), at 0xffffffff631954d8
>>>>>>>>>>  [8] mca_base_framework_components_register(0xffffffff6335e1c0,
>>>> 0x0, 0x3e,
>>>>>>>> 0x0, 0x3b, 0x100800), at 0xffffffff631b1638
>>>>>>>>>>  [9] mca_base_framework_register(0xffffffff6335e1c0, 0x0, 0x2,
>>>>>>>> 0xffffffff7fffd8d0, 0x0, 0xffffffff702a0010), at 0xffffffff631b24d4
>>>>>>>>>>  [10] mca_base_framework_open(0xffffffff6335e1c0, 0x0, 0x2,
>>>>>>>> 0xffffffff7fffd990, 0x0, 0xffffffff702a0010), at 0xffffffff631b25d0
>>>>>>>>>>  [11] opal_init(0xffffffff7fffdd70, 0xffffffff7fffdd78, 0x100117c60,
>>>>>>>> 0xffffffff7fffde58, 0x400, 0x100117c60), at
>>>>>>>>>> 0xffffffff63153694
>>>>>>>>>>  [12] orterun(0x4, 0xffffffff7fffde58, 0x2, 0xffffffff7fffdda0, 0x0,
>>>>>>>> 0xffffffff702a0010), at 0x100005078
>>>>>>>>>>  [13] main(0x4, 0xffffffff7fffde58, 0xffffffff7fffde80, 0x100117c60,
>>>>>>>> 0x100000000, 0xffffffff6a700200), at 0x100003d68
>>>>>>>>>> (dbx)
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> I get the following output with gdb.
>>>>>>>>>> 
>>>>>>>>>> tyr small_prog 107 /usr/local/gdb-7.6.1_64_gcc/bin/gdb
>>>>>>>> /usr/local/openmpi-1.8.2_64_gcc/bin/mpiexec
>>>>>>>>>> GNU gdb (GDB) 7.6.1
>>>>>>>>>> Copyright (C) 2013 Free Software Foundation, Inc.
>>>>>>>>>> License GPLv3+: GNU GPL version 3 or later
>>>>>>>> <http://gnu.org/licenses/gpl.html>
>>>>>>>>>> This is free software: you are free to change and redistribute it.
>>>>>>>>>> There is NO WARRANTY, to the extent permitted by law.  Type "show
>>>> copying"
>>>>>>>>>> and "show warranty" for details.
>>>>>>>>>> This GDB was configured as "sparc-sun-solaris2.10".
>>>>>>>>>> For bug reporting instructions, please see:
>>>>>>>>>> <http://www.gnu.org/software/gdb/bugs/>...
>>>>>>>>>> Reading symbols from
>>>>>>>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/bin/orterun...done.
>>>>>>>>>> (gdb) run -np 1 a.out
>>>>>>>>>> Starting program: /usr/local/openmpi-1.8.2_64_gcc/bin/mpiexec -np 1
>>>> a.out
>>>>>>>>>> [Thread debugging using libthread_db enabled]
>>>>>>>>>> [New Thread 1 (LWP 1)]
>>>>>>>>>> [New LWP    2        ]
>>>>>>>>>> [tyr:27867] *** Process received signal ***
>>>>>>>>>> [tyr:27867] Signal: Bus Error (10)
>>>>>>>>>> [tyr:27867] Signal code: Invalid address alignment (1)
>>>>>>>>>> [tyr:27867] Failing at address: ffffffff7fffd224
>>>>>>>>>> 
>>>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:opal_b
>>>>>>>> acktrace_print+0x2c
>>>>>>>> 
>>>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:0xccfa
>>>>>>>> 0
>>>>>>>>>> /lib/sparcv9/libc.so.1:0xd8b98
>>>>>>>>>> /lib/sparcv9/libc.so.1:0xcc70c
>>>>>>>>>> /lib/sparcv9/libc.so.1:0xcc918
>>>>>>>>>> 
>>>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/openmpi/mca_db_hash.so:0x3e
>>>>>>>> e8 [ Signal 10 (BUS)]
>>>>>>>> 
>>>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:opal_d
>>>>>>>> b_base_store+0xc8
>>>>>>>> 
>>>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_u
>>>>>>>> til_decode_pidmap+0x798
>>>>>>>> 
>>>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_u
>>>>>>>> til_nidmap_init+0x3cc
>>>>>>>> 
>>>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/openmpi/mca_ess_env.so:0x22
>>>>>>>> 6c
>>>>>>>> 
>>>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_i
>>>>>>>> nit+0x308
>>>>>>>> 
>>>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libmpi.so.1.5.2:ompi_mpi_in
>>>>>>>> it+0x31c
>>>>>>>> 
>>>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libmpi.so.1.5.2:PMPI_Init+0
>>>>>>>> x2a8
>>>>>>>> 
>>>> /home/fd1026/work/skripte/master/parallel/prog/mpi/small_prog/a.out:main+0x20
>>>> /home/fd1026/work/skripte/master/parallel/prog/mpi/small_prog/a.out:_start+0x7c
>>>>>>>>>> [tyr:27867] *** End of error message ***
>>>>>>>>>> 
>>>> --------------------------------------------------------------------------
>>>>>>>>>> mpiexec noticed that process rank 0 with PID 27867 on node tyr
>>>> exited on
>>>>>>>> signal 10 (Bus Error).
>>>> --------------------------------------------------------------------------
>>>>>>>>>> [LWP    2         exited]
>>>>>>>>>> [New Thread 2        ]
>>>>>>>>>> [Switching to Thread 1 (LWP 1)]
>>>>>>>>>> sol_thread_fetch_registers: td_ta_map_id2thr: no thread can be
>>>> found to
>>>>>>>> satisfy query
>>>>>>>>>> (gdb) bt
>>>>>>>>>> #0  0xffffffff7f6173d0 in rtld_db_dlactivity () from
>>>>>>>> /usr/lib/sparcv9/ld.so.1
>>>>>>>>>> #1  0xffffffff7f6175a8 in rd_event () from /usr/lib/sparcv9/ld.so.1
>>>>>>>>>> #2  0xffffffff7f618950 in lm_delete () from /usr/lib/sparcv9/ld.so.1
>>>>>>>>>> #3  0xffffffff7f6226bc in remove_so () from /usr/lib/sparcv9/ld.so.1
>>>>>>>>>> #4  0xffffffff7f624574 in remove_hdl () from
>>>> /usr/lib/sparcv9/ld.so.1
>>>>>>>>>> #5  0xffffffff7f61d97c in dlclose_core () from
>>>> /usr/lib/sparcv9/ld.so.1
>>>>>>>>>> #6  0xffffffff7f61d9d4 in dlclose_intn () from
>>>> /usr/lib/sparcv9/ld.so.1
>>>>>>>>>> #7  0xffffffff7f61db0c in dlclose () from /usr/lib/sparcv9/ld.so.1
>>>>>>>>>> #8  0xffffffff7ec7746c in vm_close ()
>>>>>>>>>>   from /usr/local/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6
>>>>>>>>>> #9  0xffffffff7ec74a4c in lt_dlclose ()
>>>>>>>>>>   from /usr/local/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6
>>>>>>>>>> #10 0xffffffff7ec99b70 in ri_destructor (obj=0x1001ead30)
>>>>>>>>>>    at
>>>> ../../../../openmpi-1.8.2rc3/opal/mca/base/mca_base_component_repository.c:391
>>>>>>>>>> #11 0xffffffff7ec98488 in opal_obj_run_destructors
>>>> (object=0x1001ead30)
>>>>>>>>>>    at ../../../../openmpi-1.8.2rc3/opal/class/opal_object.h:446
>>>>>>>>>> #12 0xffffffff7ec993ec in mca_base_component_repository_release (
>>>>>>>>>>    component=0xffffffff7b023cf0 <mca_oob_tcp_component>)
>>>>>>>>>>    at
>>>> ../../../../openmpi-1.8.2rc3/opal/mca/base/mca_base_component_repository.c:244
>>>>>>>>>> #13 0xffffffff7ec9b734 in mca_base_component_unload (
>>>>>>>>>>    component=0xffffffff7b023cf0 <mca_oob_tcp_component>,
>>>> output_id=-1)
>>>>>>>>>>    at
>>>> ../../../../openmpi-1.8.2rc3/opal/mca/base/mca_base_components_close.c:47
>>>>>>>>>> #14 0xffffffff7ec9b7c8 in mca_base_component_close (
>>>>>>>>>>    component=0xffffffff7b023cf0 <mca_oob_tcp_component>,
>>>> output_id=-1)
>>>>>>>>>>    at
>>>> ../../../../openmpi-1.8.2rc3/opal/mca/base/mca_base_components_close.c:60
>>>>>>>>>> #15 0xffffffff7ec9b89c in mca_base_components_close (output_id=-1,
>>>>>>>>>>    components=0xffffffff7f12b430 <orte_oob_base_framework+80>,
>>>> skip=0x0)
>>>>>>>>>> ---Type <return> to continue, or q <return> to quit---
>>>>>>>>>>    at
>>>> ../../../../openmpi-1.8.2rc3/opal/mca/base/mca_base_components_close.c:86
>>>>>>>>>> #16 0xffffffff7ec9b804 in mca_base_framework_components_close (
>>>>>>>>>>    framework=0xffffffff7f12b3e0 <orte_oob_base_framework>, skip=0x0)
>>>>>>>>>>    at
>>>> ../../../../openmpi-1.8.2rc3/opal/mca/base/mca_base_components_close.c:66
>>>>>>>>>> #17 0xffffffff7efae1e4 in orte_oob_base_close ()
>>>>>>>>>>    at
>>>> ../../../../openmpi-1.8.2rc3/orte/mca/oob/base/oob_base_frame.c:94
>>>>>>>>>> #18 0xffffffff7ecb28ac in mca_base_framework_close (
>>>>>>>>>>    framework=0xffffffff7f12b3e0 <orte_oob_base_framework>)
>>>>>>>>>>    at
>>>> ../../../../openmpi-1.8.2rc3/opal/mca/base/mca_base_framework.c:187
>>>>>>>>>> #19 0xffffffff7bf078c0 in rte_finalize ()
>>>>>>>>>>    at
>>>> ../../../../../openmpi-1.8.2rc3/orte/mca/ess/hnp/ess_hnp_module.c:858
>>>>>>>>>> #20 0xffffffff7ef30a44 in orte_finalize ()
>>>>>>>>>>    at ../../openmpi-1.8.2rc3/orte/runtime/orte_finalize.c:65
>>>>>>>>>> #21 0x00000001000070c4 in orterun (argc=4, argv=0xffffffff7fffe0e8)
>>>>>>>>>>    at ../../../../openmpi-1.8.2rc3/orte/tools/orterun/orterun.c:1096
>>>>>>>>>> #22 0x0000000100003d70 in main (argc=4, argv=0xffffffff7fffe0e8)
>>>>>>>>>>    at ../../../../openmpi-1.8.2rc3/orte/tools/orterun/main.c:13
>>>>>>>>>> (gdb)
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Is the above information helpful to track down the error? Do you
>>>> need
>>>>>>>>>> anything else? Thank you very much for any help in advance.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Kind regards
>>>>>>>>>> 
>>>>>>>>>> Siegmar
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On Jul 25, 2014, at 2:08 AM, Siegmar Gross
>>>>>>>> <siegmar.gr...@informatik.hs-fulda.de> wrote:
>>>>>>>>>>>> Hi,
>>>>>>>>>>>> 
>>>>>>>>>>>> I have installed openmpi-1.8.2rc2 with gcc-4.9.0 on Solaris
>>>>>>>>>>>> 10 Sparc and I receive a bus error, if I run a small program.
>>>>>>>>>>>> 
>>>>>>>>>>>> tyr hello_1 105 mpiexec -np 2 a.out
>>>>>>>>>>>> [tyr:29164] *** Process received signal ***
>>>>>>>>>>>> [tyr:29164] Signal: Bus Error (10)
>>>>>>>>>>>> [tyr:29164] Signal code: Invalid address alignment (1)
>>>>>>>>>>>> [tyr:29164] Failing at address: ffffffff7fffd1c4
>>>>>>>>>>>> 
>>>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:opal_b
>>>>>>>> acktrace_print+0x2c
>>>>>>>> 
>>>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:0xccfd
>>>>>>>> 0
>>>>>>>>>>>> /lib/sparcv9/libc.so.1:0xd8b98
>>>>>>>>>>>> /lib/sparcv9/libc.so.1:0xcc70c
>>>>>>>>>>>> /lib/sparcv9/libc.so.1:0xcc918
>>>>>>>>>>>> 
>>>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/openmpi/mca_db_hash.so:0x3e
>>>>>>>> e8 [ Signal 10 (BUS)]
>>>>>>>> 
>>>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:opal_d
>>>>>>>> b_base_store+0xc8
>>>>>>>> 
>>>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_u
>>>>>>>> til_decode_pidmap+0x798
>>>>>>>> 
>>>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_u
>>>>>>>> til_nidmap_init+0x3cc
>>>>>>>> 
>>>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/openmpi/mca_ess_env.so:0x22
>>>>>>>> 6c
>>>>>>>> 
>>>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_i
>>>>>>>> nit+0x308
>>>>>>>> 
>>>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libmpi.so.1.5.2:ompi_mpi_in
>>>>>>>> it+0x31c
>>>>>>>> 
>>>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libmpi.so.1.5.2:PMPI_Init+0
>>>>>>>> x2a8
>>>> /home/fd1026/work/skripte/master/parallel/prog/mpi/hello_1/a.out:main+0x20
>>>> /home/fd1026/work/skripte/master/parallel/prog/mpi/hello_1/a.out:_start+0x7c
>>>>>>>>>>>> [tyr:29164] *** End of error message ***
>>>>>>>>>>>> ...
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> I get the following output if I run the program in "dbx".
>>>>>>>>>>>> 
>>>>>>>>>>>> ...
>>>>>>>>>>>> RTC: Enabling Error Checking...
>>>>>>>>>>>> RTC: Running program...
>>>>>>>>>>>> Write to unallocated (wua) on thread 1:
>>>>>>>>>>>> Attempting to write 1 byte at address 0xffffffff79f04000
>>>>>>>>>>>> t@1 (l@1) stopped in _readdir at 0xffffffff55174da0
>>>>>>>>>>>> 0xffffffff55174da0: _readdir+0x0064:    call
>>>>>>>> _PROCEDURE_LINKAGE_TABLE_+0x2380 [PLT] ! 0xffffffff55342a80
>>>>>>>>>>>> (dbx)
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Hopefully the above output helps to fix the error. Can I provide
>>>>>>>>>>>> anything else? Thank you very much for any help in advance.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Kind regards
>>>>>>>>>>>> 
>>>>>>>>>>>> Siegmar
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org
>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>> Link to this post:
>>>> http://www.open-mpi.org/community/lists/devel/2014/08/15546.php
>>>> 
>>>> 
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> Link to this post:
>>>> http://www.open-mpi.org/community/lists/devel/2014/08/15547.php
>>>> 
>>> 
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2014/08/15549.php
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/08/15550.php
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/08/15551.php
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/08/15564.php

Reply via email to