Committed a fix for this in r32459 - please check and see if this resolves the issue.
On Aug 8, 2014, at 2:21 AM, Ralph Castain <r...@open-mpi.org> wrote: > Sorry to chime in a little late. George is likely correct about using > ORTE_NAME, only you can't do that as the OPAL layer has no idea what that > datatype looks like. This was the original reason for creating the > opal_identifier_t type - I had no other choice when we moved the db framework > (now dstore) to the OPAL layer in anticipation of the BTLs moving to OPAL. > The abstraction requirement wouldn't allow me to pass down the structure > definition. > > The easiest solution is probably to change the opal/db/hash code so that > 64-bit fields are memcpy'd instead of simply passed by "=". This should > eliminate the problem with the least fuss. > > There is a performance penalty for using non-aligned data, and ideally we > should use aligned data whenever possible. This code isn't in the critical > path and so this is less of an issue, but still would be nice to do. However, > I didn't do so for the following reasons: > > * I couldn't find a way for the compiler to check/require alignment down in > opal_db.store when passed a parameter. If someone knows of a way to do that, > please feel free to suggest it > > * none of our current developers have access to a Solaris SPARC machine, and > thus our developers cannot detect violations when they occur > > * the current solution avoids the issue, albeit with a slight performance > penalty > > I'm open to alternative methods - I'm not happy with the ugliness this > required, but couldn't come up with a cleaner solution that would be easy for > developers to know when they violated the alignment requirement. > > FWIW: it is possible, I suppose, that the other discussion about using an > opal_process_name_t that exactly mirrors orte_process_name_t could also > resolve this problem in a cleaner fashion. I didn't impose that requirement > here, but maybe it's another motivator for doing so? > > Ralph > > > On Aug 7, 2014, at 11:46 PM, Gilles Gouaillardet > <gilles.gouaillar...@iferc.org> wrote: > >> George, >> >> (one of the) faulty line was : >> >> if (ORTE_SUCCESS != (rc = >> opal_db.store((opal_identifier_t*)ORTE_PROC_MY_NAME, OPAL_SCOPE_INTERNAL, >> >> OPAL_DB_LOCALLDR, (opal_identifier_t*)&proc, OPAL_ID_T))) { >> >> so if proc is not 64 bits aligned, a SIGBUS will occur on sparc. >> as you pointed, replacing OPAL_ID_T with ORTE_NAME will very likely fix the >> issue (i have no arch to test...) >> >> i was initially also "confused" with the following line >> >> if (ORTE_SUCCESS != (rc = opal_db.store((opal_identifier_t*)&proc, >> OPAL_SCOPE_INTERNAL, >> ORTE_DB_NPROC_OFFSET, >> &offset, OPAL_UINT32))) { >> >> the first argument of store is an (opal_identifier_t *) >> strictly speaking this is "a pointer to a 64 bits aligned address", and proc >> might not be 64 bits aligned. >> /* that being said, there is no crash :-) */ >> >> in this case, opal_db.store pointer points to the store function >> (db_hash.c:178) >> and proc is only used id memcpy at line 194, so 64 bits alignment is not >> required. >> (and comment is explicit : /* to protect alignment, copy the data across */ >> >> that might sounds pedantic, but are we doing the right thing here ? >> (e.g. cast to (opal_identifier_t *), followed by a memcpy in case the >> pointer was not 64 bits aligned >> vs always use aligned data ?) >> >> Cheers, >> >> Gilles >> >> On 2014/08/08 14:58, George Bosilca wrote: >>> This is a gigantic patch for an almost trivial issue. The current problem >>> is purely related to the fact that in a single location (nidmap.c) the >>> orte_process_name_t (which is a structure of 2 integers) is supposed to be >>> aligned based on the uint64_t requirements. Bad assumption! >>> >>> Looking at the code one might notice that the orte_process_name_t is stored >>> using a particular DSS type OPAL_ID_T. This is a shortcut that doesn't hold >>> on the SPARC architecture because the two types (int32_t and int64_t) have >>> different alignments. However, ORTE define a type for orte_process_name_t. >>> Thus, I think that if instead of saving the orte_process_name_t as an >>> OPAL_ID_T, we save it as an ORTE_NAME the issue will go away. >>> >>> George. >>> >>> >>> >>> On Fri, Aug 8, 2014 at 1:04 AM, Gilles Gouaillardet < >>> gilles.gouaillar...@iferc.org> wrote: >>> >>>> Kawashima-san and all, >>>> >>>> Here is attached a one off patch for v1.8. >>>> /* it does not use the __attribute__ modifier that might not be >>>> supported by all compilers */ >>>> >>>> as far as i am concerned, the same issue is also in the trunk, >>>> and if you do not hit it, it just means you are lucky :-) >>>> >>>> the same issue might also be in other parts of the code :-( >>>> >>>> Cheers, >>>> >>>> Gilles >>>> >>>> On 2014/08/08 13:45, Kawashima, Takahiro wrote: >>>>> Gilles, George, >>>>> >>>>> The problem is the one Gilles pointed. >>>>> I temporarily modified the code bellow and the bus error disappeared. >>>>> >>>>> --- orte/util/nidmap.c (revision 32447) >>>>> +++ orte/util/nidmap.c (working copy) >>>>> @@ -885,7 +885,7 @@ >>>>> orte_proc_state_t state; >>>>> orte_app_idx_t app_idx; >>>>> int32_t restarts; >>>>> - orte_process_name_t proc, dmn; >>>>> + orte_process_name_t proc __attribute__((__aligned__(8))), dmn; >>>>> char *hostname; >>>>> uint8_t flag; >>>>> opal_buffer_t *bptr; >>>>> >>>>> Takahiro Kawashima, >>>>> MPI development team, >>>>> Fujitsu >>>>> >>>>>> Kawashima-san, >>>>>> >>>>>> This is interesting :-) >>>>>> >>>>>> proc is in the stack and has type orte_process_name_t >>>>>> >>>>>> with >>>>>> >>>>>> typedef uint32_t orte_jobid_t; >>>>>> typedef uint32_t orte_vpid_t; >>>>>> struct orte_process_name_t { >>>>>> orte_jobid_t jobid; /**< Job number */ >>>>>> orte_vpid_t vpid; /**< Process id - equivalent to rank */ >>>>>> }; >>>>>> typedef struct orte_process_name_t orte_process_name_t; >>>>>> >>>>>> >>>>>> so there is really no reason to align this on 8 bytes... >>>>>> but later, proc is casted into an uint64_t ... >>>>>> so proc should have been aligned on 8 bytes but it is too late, >>>>>> and hence the glory SIGBUS >>>>>> >>>>>> >>>>>> this is loosely related to >>>>>> http://www.open-mpi.org/community/lists/devel/2014/08/15532.php >>>>>> (see heterogeneous.v2.patch) >>>>>> if we make opal_process_name_t an union of uint64_t and a struct of two >>>>>> uint32_t, the compiler >>>>>> will align this on 8 bytes. >>>>>> note the patch is not enough (and will not apply on the v1.8 branch >>>> anyway), >>>>>> we could simply remove orte_process_name_t and ompi_process_name_t and >>>>>> use only >>>>>> opal_process_name_t (and never declare variables with type >>>>>> opal_proc_name_t otherwise alignment might be incorrect) >>>>>> >>>>>> as a workaround, you can declare an opal_process_name_t (for alignment), >>>>>> and cast it to an orte_process_name_t >>>>>> >>>>>> i will write a patch (i will not be able to test on sparc ...) >>>>>> please note this issue might be present in other places >>>>>> >>>>>> Cheers, >>>>>> >>>>>> Gilles >>>>>> >>>>>> On 2014/08/08 13:03, Kawashima, Takahiro wrote: >>>>>>> Hi, >>>>>>> >>>>>>>>>>>> I have installed openmpi-1.8.2rc2 with gcc-4.9.0 on Solaris >>>>>>>>>>>> 10 Sparc and I receive a bus error, if I run a small program. >>>>>>> I've finally reproduced the bus error in my SPARC environment. >>>>>>> >>>>>>> #0 0xffffffff00db4740 (__waitpid_nocancel + 0x44) >>>> (0x200,0x0,0x0,0xa0,0xfffff80100064af0,0x35b4) >>>>>>> #1 0xffffffff0001a310 (handle_signal + 0x574) (signo=10,info=(struct >>>> siginfo *) 0x000007feffffd100,p=(void *) 0x000007feffffd100) at line 277 in >>>> ../sigattach.c <SIGNAL HANDLER> >>>>>>> #2 0xffffffff0282aff4 (store + 0x540) (uid=(unsigned long *) >>>> 0xffffffff0118a128,scope=8:'\b',key=(char *) 0xffffffff0106a0a8 >>>> "opal.local.ldr",data=(void *) 0x000007feffffde74,type=15:'\017') at line >>>> 252 in db_hash.c >>>>>>> #3 0xffffffff01266350 (opal_db_base_store + 0xc4) (proc=(unsigned long >>>> *) 0xffffffff0118a128,scope=8:'\b',key=(char *) 0xffffffff0106a0a8 >>>> "opal.local.ldr",object=(void *) 0x000007feffffde74,type=15:'\017') at line >>>> 49 in db_base_fns.c >>>>>>> #4 0xffffffff00fdbab4 (orte_util_decode_pidmap + 0x790) (bo=(struct *) >>>> 0x0000000000281d70) at line 975 in nidmap.c >>>>>>> #5 0xffffffff00fd6d20 (orte_util_nidmap_init + 0x3dc) (buffer=(struct >>>> opal_buffer_t *) 0x0000000000241fc0) at line 141 in nidmap.c >>>>>>> #6 0xffffffff01e298cc (rte_init + 0x2a0) () at line 153 in >>>> ess_env_module.c >>>>>>> #7 0xffffffff00f9f28c (orte_init + 0x308) (pargc=(int *) >>>> 0x0000000000000000,pargv=(char ***) 0x0000000000000000,flags=32) at line >>>> 148 in orte_init.c >>>>>>> #8 0xffffffff001a6f08 (ompi_mpi_init + 0x31c) (argc=1,argv=(char **) >>>> 0x000007fefffff348,requested=0,provided=(int *) 0x000007feffffe698) at line >>>> 464 in ompi_mpi_init.c >>>>>>> #9 0xffffffff001ff79c (MPI_Init + 0x2b0) (argc=(int *) >>>> 0x000007feffffe814,argv=(char ***) 0x000007feffffe818) at line 84 in init.c >>>>>>> #10 0x0000000000100ae4 (main + 0x44) (argc=1,argv=(char **) >>>> 0x000007fefffff348) at line 8 in mpiinitfinalize.c >>>>>>> #11 0xffffffff00d2b81c (__libc_start_main + 0x194) >>>> (0x100aa0,0x1,0x7fefffff348,0x100d24,0x100d14,0x0) >>>>>>> #12 0x000000000010094c (_start + 0x2c) () >>>>>>> >>>>>>> The line 252 in opal/mca/db/hash/db_hash.c is: >>>>>>> >>>>>>> case OPAL_UINT64: >>>>>>> if (NULL == data) { >>>>>>> OPAL_ERROR_LOG(OPAL_ERR_BAD_PARAM); >>>>>>> return OPAL_ERR_BAD_PARAM; >>>>>>> } >>>>>>> kv->type = OPAL_UINT64; >>>>>>> kv->data.uint64 = *(uint64_t*)(data); // !!! here !!! >>>>>>> break; >>>>>>> >>>>>>> My environment is: >>>>>>> >>>>>>> Open MPI v1.8 branch r32447 (latest) >>>>>>> configure --enable-debug >>>>>>> SPARC-V9 (Fujitsu SPARC64 IXfx) >>>>>>> Linux (custom) >>>>>>> gcc 4.2.4 >>>>>>> >>>>>>> I could not reproduce it with Open MPI trunk nor with Fujitsu compiler. >>>>>>> >>>>>>> Can this information help? >>>>>>> >>>>>>> Takahiro Kawashima, >>>>>>> MPI development team, >>>>>>> Fujitsu >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> I'm sorry once more to answer late, but the last two days our mail >>>>>>>> server was down (hardware error). >>>>>>>> >>>>>>>>> Did you configure this --enable-debug? >>>>>>>> Yes, I used the following command. >>>>>>>> >>>>>>>> ../openmpi-1.8.2rc3/configure >>>> --prefix=/usr/local/openmpi-1.8.2_64_gcc \ >>>>>>>> --libdir=/usr/local/openmpi-1.8.2_64_gcc/lib64 \ >>>>>>>> --with-jdk-bindir=/usr/local/jdk1.8.0/bin \ >>>>>>>> --with-jdk-headers=/usr/local/jdk1.8.0/include \ >>>>>>>> JAVA_HOME=/usr/local/jdk1.8.0 \ >>>>>>>> LDFLAGS="-m64 -L/usr/local/gcc-4.9.0/lib/amd64" \ >>>>>>>> CC="gcc" CXX="g++" FC="gfortran" \ >>>>>>>> CFLAGS="-m64" CXXFLAGS="-m64" FCFLAGS="-m64" \ >>>>>>>> CPP="cpp" CXXCPP="cpp" \ >>>>>>>> CPPFLAGS="" CXXCPPFLAGS="" \ >>>>>>>> --enable-mpi-cxx \ >>>>>>>> --enable-cxx-exceptions \ >>>>>>>> --enable-mpi-java \ >>>>>>>> --enable-heterogeneous \ >>>>>>>> --enable-mpi-thread-multiple \ >>>>>>>> --with-threads=posix \ >>>>>>>> --with-hwloc=internal \ >>>>>>>> --without-verbs \ >>>>>>>> --with-wrapper-cflags="-std=c11 -m64" \ >>>>>>>> --enable-debug \ >>>>>>>> |& tee log.configure.$SYSTEM_ENV.$MACHINE_ENV.64_gcc >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> If so, you should get a line number in the backtrace >>>>>>>> I got them for gdb (see below), but not for "dbx". >>>>>>>> >>>>>>>> >>>>>>>> Kind regards >>>>>>>> >>>>>>>> Siegmar >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> On Aug 5, 2014, at 2:59 AM, Siegmar Gross >>>>>>>> <siegmar.gr...@informatik.hs-fulda.de> wrote: >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> I'm sorry to answer so late, but last week I didn't have Internet >>>>>>>>>> access. In the meantime I've installed openmpi-1.8.2rc3 and I get >>>>>>>>>> the same error. >>>>>>>>>> >>>>>>>>>>> This looks like the typical type of alignment error that we used >>>>>>>>>>> to see when testing regularly on SPARC. :-\ >>>>>>>>>>> >>>>>>>>>>> It looks like the error was happening in mca_db_hash.so. Could >>>>>>>>>>> you get a stack trace / file+line number where it was failing >>>>>>>>>>> in mca_db_hash? (i.e., the actual bad code will likely be under >>>>>>>>>>> opal/mca/db/hash somewhere) >>>>>>>>>> Unfortunately I don't get a file+line number from a file in >>>>>>>>>> opal/mca/db/Hash. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> tyr small_prog 102 ompi_info | grep MPI: >>>>>>>>>> Open MPI: 1.8.2rc3 >>>>>>>>>> tyr small_prog 103 which mpicc >>>>>>>>>> /usr/local/openmpi-1.8.2_64_gcc/bin/mpicc >>>>>>>>>> tyr small_prog 104 mpicc init_finalize.c >>>>>>>>>> tyr small_prog 106 /opt/solstudio12.3/bin/sparcv9/dbx >>>>>>>> /usr/local/openmpi-1.8.2_64_gcc/bin/mpiexec >>>>>>>>>> For information about new features see `help changes' >>>>>>>>>> To remove this message, put `dbxenv suppress_startup_message 7.9' >>>> in your >>>>>>>> .dbxrc >>>>>>>>>> Reading mpiexec >>>>>>>>>> Reading ld.so.1 >>>>>>>>>> Reading libopen-rte.so.7.0.4 >>>>>>>>>> Reading libopen-pal.so.6.2.0 >>>>>>>>>> Reading libsendfile.so.1 >>>>>>>>>> Reading libpicl.so.1 >>>>>>>>>> Reading libkstat.so.1 >>>>>>>>>> Reading liblgrp.so.1 >>>>>>>>>> Reading libsocket.so.1 >>>>>>>>>> Reading libnsl.so.1 >>>>>>>>>> Reading libgcc_s.so.1 >>>>>>>>>> Reading librt.so.1 >>>>>>>>>> Reading libm.so.2 >>>>>>>>>> Reading libpthread.so.1 >>>>>>>>>> Reading libc.so.1 >>>>>>>>>> Reading libdoor.so.1 >>>>>>>>>> Reading libaio.so.1 >>>>>>>>>> Reading libmd.so.1 >>>>>>>>>> (dbx) check -all >>>>>>>>>> access checking - ON >>>>>>>>>> memuse checking - ON >>>>>>>>>> (dbx) run -np 1 a.outRunning: mpiexec -np 1 a.out >>>>>>>>>> (process id 27833) >>>>>>>>>> Reading rtcapihook.so >>>>>>>>>> Reading libdl.so.1 >>>>>>>>>> Reading rtcaudit.so >>>>>>>>>> Reading libmapmalloc.so.1 >>>>>>>>>> Reading libgen.so.1 >>>>>>>>>> Reading libc_psr.so.1 >>>>>>>>>> Reading rtcboot.so >>>>>>>>>> Reading librtc.so >>>>>>>>>> Reading libmd_psr.so.1 >>>>>>>>>> RTC: Enabling Error Checking... >>>>>>>>>> RTC: Running program... >>>>>>>>>> Write to unallocated (wua) on thread 1: >>>>>>>>>> Attempting to write 1 byte at address 0xffffffff79f04000 >>>>>>>>>> t@1 (l@1) stopped in _readdir at 0xffffffff55174da0 >>>>>>>>>> 0xffffffff55174da0: _readdir+0x0064: call >>>>>>>> _PROCEDURE_LINKAGE_TABLE_+0x2380 [PLT] ! 0xffffffff55342a80 >>>>>>>>>> (dbx) where >>>>>>>>>> current thread: t@1 >>>>>>>>>> =>[1] _readdir(0xffffffff79f00300, 0x2e6800, 0x4, 0x2d, 0x4, >>>>>>>> 0xffffffff79f00300), at 0xffffffff55174da0 >>>>>>>>>> [2] list_files_by_dir(0x100138fd8, 0xffffffff7fffd1f0, >>>> 0xffffffff7fffd1e8, >>>>>>>> 0xffffffff7fffd210, 0x0, 0xffffffff702a0010), at >>>>>>>>>> 0xffffffff63174594 >>>>>>>>>> [3] foreachfile_callback(0x100138fd8, 0xffffffff7fffd458, 0x0, >>>> 0x2e, 0x0, >>>>>>>> 0xffffffff702a0010), at 0xffffffff6317461c >>>>>>>>>> [4] foreach_dirinpath(0x1001d8a28, 0x0, 0xffffffff631745e0, >>>>>>>> 0xffffffff7fffd458, 0x0, 0xffffffff702a0010), at 0xffffffff63171684 >>>>>>>>>> [5] lt_dlforeachfile(0x1001d8a28, 0xffffffff6319656c, 0x0, 0x53, >>>> 0x2f, >>>>>>>> 0xf), at 0xffffffff63174748 >>>>>>>>>> [6] find_dyn_components(0x0, 0xffffffff6323b570, 0x0, 0x1, >>>>>>>> 0xffffffff7fffd6a0, 0xffffffff702a0010), at 0xffffffff63195e38 >>>>>>>>>> [7] mca_base_component_find(0x0, 0xffffffff6323b570, >>>> 0xffffffff6335e1b0, >>>>>>>> 0x0, 0xffffffff7fffd6a0, 0x1), at 0xffffffff631954d8 >>>>>>>>>> [8] mca_base_framework_components_register(0xffffffff6335e1c0, >>>> 0x0, 0x3e, >>>>>>>> 0x0, 0x3b, 0x100800), at 0xffffffff631b1638 >>>>>>>>>> [9] mca_base_framework_register(0xffffffff6335e1c0, 0x0, 0x2, >>>>>>>> 0xffffffff7fffd8d0, 0x0, 0xffffffff702a0010), at 0xffffffff631b24d4 >>>>>>>>>> [10] mca_base_framework_open(0xffffffff6335e1c0, 0x0, 0x2, >>>>>>>> 0xffffffff7fffd990, 0x0, 0xffffffff702a0010), at 0xffffffff631b25d0 >>>>>>>>>> [11] opal_init(0xffffffff7fffdd70, 0xffffffff7fffdd78, 0x100117c60, >>>>>>>> 0xffffffff7fffde58, 0x400, 0x100117c60), at >>>>>>>>>> 0xffffffff63153694 >>>>>>>>>> [12] orterun(0x4, 0xffffffff7fffde58, 0x2, 0xffffffff7fffdda0, 0x0, >>>>>>>> 0xffffffff702a0010), at 0x100005078 >>>>>>>>>> [13] main(0x4, 0xffffffff7fffde58, 0xffffffff7fffde80, 0x100117c60, >>>>>>>> 0x100000000, 0xffffffff6a700200), at 0x100003d68 >>>>>>>>>> (dbx) >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> I get the following output with gdb. >>>>>>>>>> >>>>>>>>>> tyr small_prog 107 /usr/local/gdb-7.6.1_64_gcc/bin/gdb >>>>>>>> /usr/local/openmpi-1.8.2_64_gcc/bin/mpiexec >>>>>>>>>> GNU gdb (GDB) 7.6.1 >>>>>>>>>> Copyright (C) 2013 Free Software Foundation, Inc. >>>>>>>>>> License GPLv3+: GNU GPL version 3 or later >>>>>>>> <http://gnu.org/licenses/gpl.html> >>>>>>>>>> This is free software: you are free to change and redistribute it. >>>>>>>>>> There is NO WARRANTY, to the extent permitted by law. Type "show >>>> copying" >>>>>>>>>> and "show warranty" for details. >>>>>>>>>> This GDB was configured as "sparc-sun-solaris2.10". >>>>>>>>>> For bug reporting instructions, please see: >>>>>>>>>> <http://www.gnu.org/software/gdb/bugs/>... >>>>>>>>>> Reading symbols from >>>>>>>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/bin/orterun...done. >>>>>>>>>> (gdb) run -np 1 a.out >>>>>>>>>> Starting program: /usr/local/openmpi-1.8.2_64_gcc/bin/mpiexec -np 1 >>>> a.out >>>>>>>>>> [Thread debugging using libthread_db enabled] >>>>>>>>>> [New Thread 1 (LWP 1)] >>>>>>>>>> [New LWP 2 ] >>>>>>>>>> [tyr:27867] *** Process received signal *** >>>>>>>>>> [tyr:27867] Signal: Bus Error (10) >>>>>>>>>> [tyr:27867] Signal code: Invalid address alignment (1) >>>>>>>>>> [tyr:27867] Failing at address: ffffffff7fffd224 >>>>>>>>>> >>>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:opal_b >>>>>>>> acktrace_print+0x2c >>>>>>>> >>>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:0xccfa >>>>>>>> 0 >>>>>>>>>> /lib/sparcv9/libc.so.1:0xd8b98 >>>>>>>>>> /lib/sparcv9/libc.so.1:0xcc70c >>>>>>>>>> /lib/sparcv9/libc.so.1:0xcc918 >>>>>>>>>> >>>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/openmpi/mca_db_hash.so:0x3e >>>>>>>> e8 [ Signal 10 (BUS)] >>>>>>>> >>>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:opal_d >>>>>>>> b_base_store+0xc8 >>>>>>>> >>>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_u >>>>>>>> til_decode_pidmap+0x798 >>>>>>>> >>>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_u >>>>>>>> til_nidmap_init+0x3cc >>>>>>>> >>>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/openmpi/mca_ess_env.so:0x22 >>>>>>>> 6c >>>>>>>> >>>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_i >>>>>>>> nit+0x308 >>>>>>>> >>>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libmpi.so.1.5.2:ompi_mpi_in >>>>>>>> it+0x31c >>>>>>>> >>>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libmpi.so.1.5.2:PMPI_Init+0 >>>>>>>> x2a8 >>>>>>>> >>>> /home/fd1026/work/skripte/master/parallel/prog/mpi/small_prog/a.out:main+0x20 >>>> /home/fd1026/work/skripte/master/parallel/prog/mpi/small_prog/a.out:_start+0x7c >>>>>>>>>> [tyr:27867] *** End of error message *** >>>>>>>>>> >>>> -------------------------------------------------------------------------- >>>>>>>>>> mpiexec noticed that process rank 0 with PID 27867 on node tyr >>>> exited on >>>>>>>> signal 10 (Bus Error). >>>> -------------------------------------------------------------------------- >>>>>>>>>> [LWP 2 exited] >>>>>>>>>> [New Thread 2 ] >>>>>>>>>> [Switching to Thread 1 (LWP 1)] >>>>>>>>>> sol_thread_fetch_registers: td_ta_map_id2thr: no thread can be >>>> found to >>>>>>>> satisfy query >>>>>>>>>> (gdb) bt >>>>>>>>>> #0 0xffffffff7f6173d0 in rtld_db_dlactivity () from >>>>>>>> /usr/lib/sparcv9/ld.so.1 >>>>>>>>>> #1 0xffffffff7f6175a8 in rd_event () from /usr/lib/sparcv9/ld.so.1 >>>>>>>>>> #2 0xffffffff7f618950 in lm_delete () from /usr/lib/sparcv9/ld.so.1 >>>>>>>>>> #3 0xffffffff7f6226bc in remove_so () from /usr/lib/sparcv9/ld.so.1 >>>>>>>>>> #4 0xffffffff7f624574 in remove_hdl () from >>>> /usr/lib/sparcv9/ld.so.1 >>>>>>>>>> #5 0xffffffff7f61d97c in dlclose_core () from >>>> /usr/lib/sparcv9/ld.so.1 >>>>>>>>>> #6 0xffffffff7f61d9d4 in dlclose_intn () from >>>> /usr/lib/sparcv9/ld.so.1 >>>>>>>>>> #7 0xffffffff7f61db0c in dlclose () from /usr/lib/sparcv9/ld.so.1 >>>>>>>>>> #8 0xffffffff7ec7746c in vm_close () >>>>>>>>>> from /usr/local/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6 >>>>>>>>>> #9 0xffffffff7ec74a4c in lt_dlclose () >>>>>>>>>> from /usr/local/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6 >>>>>>>>>> #10 0xffffffff7ec99b70 in ri_destructor (obj=0x1001ead30) >>>>>>>>>> at >>>> ../../../../openmpi-1.8.2rc3/opal/mca/base/mca_base_component_repository.c:391 >>>>>>>>>> #11 0xffffffff7ec98488 in opal_obj_run_destructors >>>> (object=0x1001ead30) >>>>>>>>>> at ../../../../openmpi-1.8.2rc3/opal/class/opal_object.h:446 >>>>>>>>>> #12 0xffffffff7ec993ec in mca_base_component_repository_release ( >>>>>>>>>> component=0xffffffff7b023cf0 <mca_oob_tcp_component>) >>>>>>>>>> at >>>> ../../../../openmpi-1.8.2rc3/opal/mca/base/mca_base_component_repository.c:244 >>>>>>>>>> #13 0xffffffff7ec9b734 in mca_base_component_unload ( >>>>>>>>>> component=0xffffffff7b023cf0 <mca_oob_tcp_component>, >>>> output_id=-1) >>>>>>>>>> at >>>> ../../../../openmpi-1.8.2rc3/opal/mca/base/mca_base_components_close.c:47 >>>>>>>>>> #14 0xffffffff7ec9b7c8 in mca_base_component_close ( >>>>>>>>>> component=0xffffffff7b023cf0 <mca_oob_tcp_component>, >>>> output_id=-1) >>>>>>>>>> at >>>> ../../../../openmpi-1.8.2rc3/opal/mca/base/mca_base_components_close.c:60 >>>>>>>>>> #15 0xffffffff7ec9b89c in mca_base_components_close (output_id=-1, >>>>>>>>>> components=0xffffffff7f12b430 <orte_oob_base_framework+80>, >>>> skip=0x0) >>>>>>>>>> ---Type <return> to continue, or q <return> to quit--- >>>>>>>>>> at >>>> ../../../../openmpi-1.8.2rc3/opal/mca/base/mca_base_components_close.c:86 >>>>>>>>>> #16 0xffffffff7ec9b804 in mca_base_framework_components_close ( >>>>>>>>>> framework=0xffffffff7f12b3e0 <orte_oob_base_framework>, skip=0x0) >>>>>>>>>> at >>>> ../../../../openmpi-1.8.2rc3/opal/mca/base/mca_base_components_close.c:66 >>>>>>>>>> #17 0xffffffff7efae1e4 in orte_oob_base_close () >>>>>>>>>> at >>>> ../../../../openmpi-1.8.2rc3/orte/mca/oob/base/oob_base_frame.c:94 >>>>>>>>>> #18 0xffffffff7ecb28ac in mca_base_framework_close ( >>>>>>>>>> framework=0xffffffff7f12b3e0 <orte_oob_base_framework>) >>>>>>>>>> at >>>> ../../../../openmpi-1.8.2rc3/opal/mca/base/mca_base_framework.c:187 >>>>>>>>>> #19 0xffffffff7bf078c0 in rte_finalize () >>>>>>>>>> at >>>> ../../../../../openmpi-1.8.2rc3/orte/mca/ess/hnp/ess_hnp_module.c:858 >>>>>>>>>> #20 0xffffffff7ef30a44 in orte_finalize () >>>>>>>>>> at ../../openmpi-1.8.2rc3/orte/runtime/orte_finalize.c:65 >>>>>>>>>> #21 0x00000001000070c4 in orterun (argc=4, argv=0xffffffff7fffe0e8) >>>>>>>>>> at ../../../../openmpi-1.8.2rc3/orte/tools/orterun/orterun.c:1096 >>>>>>>>>> #22 0x0000000100003d70 in main (argc=4, argv=0xffffffff7fffe0e8) >>>>>>>>>> at ../../../../openmpi-1.8.2rc3/orte/tools/orterun/main.c:13 >>>>>>>>>> (gdb) >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Is the above information helpful to track down the error? Do you >>>> need >>>>>>>>>> anything else? Thank you very much for any help in advance. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Kind regards >>>>>>>>>> >>>>>>>>>> Siegmar >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> On Jul 25, 2014, at 2:08 AM, Siegmar Gross >>>>>>>> <siegmar.gr...@informatik.hs-fulda.de> wrote: >>>>>>>>>>>> Hi, >>>>>>>>>>>> >>>>>>>>>>>> I have installed openmpi-1.8.2rc2 with gcc-4.9.0 on Solaris >>>>>>>>>>>> 10 Sparc and I receive a bus error, if I run a small program. >>>>>>>>>>>> >>>>>>>>>>>> tyr hello_1 105 mpiexec -np 2 a.out >>>>>>>>>>>> [tyr:29164] *** Process received signal *** >>>>>>>>>>>> [tyr:29164] Signal: Bus Error (10) >>>>>>>>>>>> [tyr:29164] Signal code: Invalid address alignment (1) >>>>>>>>>>>> [tyr:29164] Failing at address: ffffffff7fffd1c4 >>>>>>>>>>>> >>>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:opal_b >>>>>>>> acktrace_print+0x2c >>>>>>>> >>>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:0xccfd >>>>>>>> 0 >>>>>>>>>>>> /lib/sparcv9/libc.so.1:0xd8b98 >>>>>>>>>>>> /lib/sparcv9/libc.so.1:0xcc70c >>>>>>>>>>>> /lib/sparcv9/libc.so.1:0xcc918 >>>>>>>>>>>> >>>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/openmpi/mca_db_hash.so:0x3e >>>>>>>> e8 [ Signal 10 (BUS)] >>>>>>>> >>>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:opal_d >>>>>>>> b_base_store+0xc8 >>>>>>>> >>>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_u >>>>>>>> til_decode_pidmap+0x798 >>>>>>>> >>>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_u >>>>>>>> til_nidmap_init+0x3cc >>>>>>>> >>>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/openmpi/mca_ess_env.so:0x22 >>>>>>>> 6c >>>>>>>> >>>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_i >>>>>>>> nit+0x308 >>>>>>>> >>>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libmpi.so.1.5.2:ompi_mpi_in >>>>>>>> it+0x31c >>>>>>>> >>>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libmpi.so.1.5.2:PMPI_Init+0 >>>>>>>> x2a8 >>>> /home/fd1026/work/skripte/master/parallel/prog/mpi/hello_1/a.out:main+0x20 >>>> /home/fd1026/work/skripte/master/parallel/prog/mpi/hello_1/a.out:_start+0x7c >>>>>>>>>>>> [tyr:29164] *** End of error message *** >>>>>>>>>>>> ... >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> I get the following output if I run the program in "dbx". >>>>>>>>>>>> >>>>>>>>>>>> ... >>>>>>>>>>>> RTC: Enabling Error Checking... >>>>>>>>>>>> RTC: Running program... >>>>>>>>>>>> Write to unallocated (wua) on thread 1: >>>>>>>>>>>> Attempting to write 1 byte at address 0xffffffff79f04000 >>>>>>>>>>>> t@1 (l@1) stopped in _readdir at 0xffffffff55174da0 >>>>>>>>>>>> 0xffffffff55174da0: _readdir+0x0064: call >>>>>>>> _PROCEDURE_LINKAGE_TABLE_+0x2380 [PLT] ! 0xffffffff55342a80 >>>>>>>>>>>> (dbx) >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Hopefully the above output helps to fix the error. Can I provide >>>>>>>>>>>> anything else? Thank you very much for any help in advance. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Kind regards >>>>>>>>>>>> >>>>>>>>>>>> Siegmar >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/devel/2014/08/15546.php >>>> >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/devel/2014/08/15547.php >>>> >>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2014/08/15549.php >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/08/15550.php >