Sorry to chime in a little late. George is likely correct about using ORTE_NAME, only you can't do that as the OPAL layer has no idea what that datatype looks like. This was the original reason for creating the opal_identifier_t type - I had no other choice when we moved the db framework (now dstore) to the OPAL layer in anticipation of the BTLs moving to OPAL. The abstraction requirement wouldn't allow me to pass down the structure definition.
The easiest solution is probably to change the opal/db/hash code so that 64-bit fields are memcpy'd instead of simply passed by "=". This should eliminate the problem with the least fuss. There is a performance penalty for using non-aligned data, and ideally we should use aligned data whenever possible. This code isn't in the critical path and so this is less of an issue, but still would be nice to do. However, I didn't do so for the following reasons: * I couldn't find a way for the compiler to check/require alignment down in opal_db.store when passed a parameter. If someone knows of a way to do that, please feel free to suggest it * none of our current developers have access to a Solaris SPARC machine, and thus our developers cannot detect violations when they occur * the current solution avoids the issue, albeit with a slight performance penalty I'm open to alternative methods - I'm not happy with the ugliness this required, but couldn't come up with a cleaner solution that would be easy for developers to know when they violated the alignment requirement. FWIW: it is possible, I suppose, that the other discussion about using an opal_process_name_t that exactly mirrors orte_process_name_t could also resolve this problem in a cleaner fashion. I didn't impose that requirement here, but maybe it's another motivator for doing so? Ralph On Aug 7, 2014, at 11:46 PM, Gilles Gouaillardet <gilles.gouaillar...@iferc.org> wrote: > George, > > (one of the) faulty line was : > > if (ORTE_SUCCESS != (rc = > opal_db.store((opal_identifier_t*)ORTE_PROC_MY_NAME, OPAL_SCOPE_INTERNAL, > OPAL_DB_LOCALLDR, > (opal_identifier_t*)&proc, OPAL_ID_T))) { > > so if proc is not 64 bits aligned, a SIGBUS will occur on sparc. > as you pointed, replacing OPAL_ID_T with ORTE_NAME will very likely fix the > issue (i have no arch to test...) > > i was initially also "confused" with the following line > > if (ORTE_SUCCESS != (rc = opal_db.store((opal_identifier_t*)&proc, > OPAL_SCOPE_INTERNAL, > ORTE_DB_NPROC_OFFSET, > &offset, OPAL_UINT32))) { > > the first argument of store is an (opal_identifier_t *) > strictly speaking this is "a pointer to a 64 bits aligned address", and proc > might not be 64 bits aligned. > /* that being said, there is no crash :-) */ > > in this case, opal_db.store pointer points to the store function > (db_hash.c:178) > and proc is only used id memcpy at line 194, so 64 bits alignment is not > required. > (and comment is explicit : /* to protect alignment, copy the data across */ > > that might sounds pedantic, but are we doing the right thing here ? > (e.g. cast to (opal_identifier_t *), followed by a memcpy in case the > pointer was not 64 bits aligned > vs always use aligned data ?) > > Cheers, > > Gilles > > On 2014/08/08 14:58, George Bosilca wrote: >> This is a gigantic patch for an almost trivial issue. The current problem >> is purely related to the fact that in a single location (nidmap.c) the >> orte_process_name_t (which is a structure of 2 integers) is supposed to be >> aligned based on the uint64_t requirements. Bad assumption! >> >> Looking at the code one might notice that the orte_process_name_t is stored >> using a particular DSS type OPAL_ID_T. This is a shortcut that doesn't hold >> on the SPARC architecture because the two types (int32_t and int64_t) have >> different alignments. However, ORTE define a type for orte_process_name_t. >> Thus, I think that if instead of saving the orte_process_name_t as an >> OPAL_ID_T, we save it as an ORTE_NAME the issue will go away. >> >> George. >> >> >> >> On Fri, Aug 8, 2014 at 1:04 AM, Gilles Gouaillardet < >> gilles.gouaillar...@iferc.org> wrote: >> >>> Kawashima-san and all, >>> >>> Here is attached a one off patch for v1.8. >>> /* it does not use the __attribute__ modifier that might not be >>> supported by all compilers */ >>> >>> as far as i am concerned, the same issue is also in the trunk, >>> and if you do not hit it, it just means you are lucky :-) >>> >>> the same issue might also be in other parts of the code :-( >>> >>> Cheers, >>> >>> Gilles >>> >>> On 2014/08/08 13:45, Kawashima, Takahiro wrote: >>>> Gilles, George, >>>> >>>> The problem is the one Gilles pointed. >>>> I temporarily modified the code bellow and the bus error disappeared. >>>> >>>> --- orte/util/nidmap.c (revision 32447) >>>> +++ orte/util/nidmap.c (working copy) >>>> @@ -885,7 +885,7 @@ >>>> orte_proc_state_t state; >>>> orte_app_idx_t app_idx; >>>> int32_t restarts; >>>> - orte_process_name_t proc, dmn; >>>> + orte_process_name_t proc __attribute__((__aligned__(8))), dmn; >>>> char *hostname; >>>> uint8_t flag; >>>> opal_buffer_t *bptr; >>>> >>>> Takahiro Kawashima, >>>> MPI development team, >>>> Fujitsu >>>> >>>>> Kawashima-san, >>>>> >>>>> This is interesting :-) >>>>> >>>>> proc is in the stack and has type orte_process_name_t >>>>> >>>>> with >>>>> >>>>> typedef uint32_t orte_jobid_t; >>>>> typedef uint32_t orte_vpid_t; >>>>> struct orte_process_name_t { >>>>> orte_jobid_t jobid; /**< Job number */ >>>>> orte_vpid_t vpid; /**< Process id - equivalent to rank */ >>>>> }; >>>>> typedef struct orte_process_name_t orte_process_name_t; >>>>> >>>>> >>>>> so there is really no reason to align this on 8 bytes... >>>>> but later, proc is casted into an uint64_t ... >>>>> so proc should have been aligned on 8 bytes but it is too late, >>>>> and hence the glory SIGBUS >>>>> >>>>> >>>>> this is loosely related to >>>>> http://www.open-mpi.org/community/lists/devel/2014/08/15532.php >>>>> (see heterogeneous.v2.patch) >>>>> if we make opal_process_name_t an union of uint64_t and a struct of two >>>>> uint32_t, the compiler >>>>> will align this on 8 bytes. >>>>> note the patch is not enough (and will not apply on the v1.8 branch >>> anyway), >>>>> we could simply remove orte_process_name_t and ompi_process_name_t and >>>>> use only >>>>> opal_process_name_t (and never declare variables with type >>>>> opal_proc_name_t otherwise alignment might be incorrect) >>>>> >>>>> as a workaround, you can declare an opal_process_name_t (for alignment), >>>>> and cast it to an orte_process_name_t >>>>> >>>>> i will write a patch (i will not be able to test on sparc ...) >>>>> please note this issue might be present in other places >>>>> >>>>> Cheers, >>>>> >>>>> Gilles >>>>> >>>>> On 2014/08/08 13:03, Kawashima, Takahiro wrote: >>>>>> Hi, >>>>>> >>>>>>>>>>> I have installed openmpi-1.8.2rc2 with gcc-4.9.0 on Solaris >>>>>>>>>>> 10 Sparc and I receive a bus error, if I run a small program. >>>>>> I've finally reproduced the bus error in my SPARC environment. >>>>>> >>>>>> #0 0xffffffff00db4740 (__waitpid_nocancel + 0x44) >>> (0x200,0x0,0x0,0xa0,0xfffff80100064af0,0x35b4) >>>>>> #1 0xffffffff0001a310 (handle_signal + 0x574) (signo=10,info=(struct >>> siginfo *) 0x000007feffffd100,p=(void *) 0x000007feffffd100) at line 277 in >>> ../sigattach.c <SIGNAL HANDLER> >>>>>> #2 0xffffffff0282aff4 (store + 0x540) (uid=(unsigned long *) >>> 0xffffffff0118a128,scope=8:'\b',key=(char *) 0xffffffff0106a0a8 >>> "opal.local.ldr",data=(void *) 0x000007feffffde74,type=15:'\017') at line >>> 252 in db_hash.c >>>>>> #3 0xffffffff01266350 (opal_db_base_store + 0xc4) (proc=(unsigned long >>> *) 0xffffffff0118a128,scope=8:'\b',key=(char *) 0xffffffff0106a0a8 >>> "opal.local.ldr",object=(void *) 0x000007feffffde74,type=15:'\017') at line >>> 49 in db_base_fns.c >>>>>> #4 0xffffffff00fdbab4 (orte_util_decode_pidmap + 0x790) (bo=(struct *) >>> 0x0000000000281d70) at line 975 in nidmap.c >>>>>> #5 0xffffffff00fd6d20 (orte_util_nidmap_init + 0x3dc) (buffer=(struct >>> opal_buffer_t *) 0x0000000000241fc0) at line 141 in nidmap.c >>>>>> #6 0xffffffff01e298cc (rte_init + 0x2a0) () at line 153 in >>> ess_env_module.c >>>>>> #7 0xffffffff00f9f28c (orte_init + 0x308) (pargc=(int *) >>> 0x0000000000000000,pargv=(char ***) 0x0000000000000000,flags=32) at line >>> 148 in orte_init.c >>>>>> #8 0xffffffff001a6f08 (ompi_mpi_init + 0x31c) (argc=1,argv=(char **) >>> 0x000007fefffff348,requested=0,provided=(int *) 0x000007feffffe698) at line >>> 464 in ompi_mpi_init.c >>>>>> #9 0xffffffff001ff79c (MPI_Init + 0x2b0) (argc=(int *) >>> 0x000007feffffe814,argv=(char ***) 0x000007feffffe818) at line 84 in init.c >>>>>> #10 0x0000000000100ae4 (main + 0x44) (argc=1,argv=(char **) >>> 0x000007fefffff348) at line 8 in mpiinitfinalize.c >>>>>> #11 0xffffffff00d2b81c (__libc_start_main + 0x194) >>> (0x100aa0,0x1,0x7fefffff348,0x100d24,0x100d14,0x0) >>>>>> #12 0x000000000010094c (_start + 0x2c) () >>>>>> >>>>>> The line 252 in opal/mca/db/hash/db_hash.c is: >>>>>> >>>>>> case OPAL_UINT64: >>>>>> if (NULL == data) { >>>>>> OPAL_ERROR_LOG(OPAL_ERR_BAD_PARAM); >>>>>> return OPAL_ERR_BAD_PARAM; >>>>>> } >>>>>> kv->type = OPAL_UINT64; >>>>>> kv->data.uint64 = *(uint64_t*)(data); // !!! here !!! >>>>>> break; >>>>>> >>>>>> My environment is: >>>>>> >>>>>> Open MPI v1.8 branch r32447 (latest) >>>>>> configure --enable-debug >>>>>> SPARC-V9 (Fujitsu SPARC64 IXfx) >>>>>> Linux (custom) >>>>>> gcc 4.2.4 >>>>>> >>>>>> I could not reproduce it with Open MPI trunk nor with Fujitsu compiler. >>>>>> >>>>>> Can this information help? >>>>>> >>>>>> Takahiro Kawashima, >>>>>> MPI development team, >>>>>> Fujitsu >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I'm sorry once more to answer late, but the last two days our mail >>>>>>> server was down (hardware error). >>>>>>> >>>>>>>> Did you configure this --enable-debug? >>>>>>> Yes, I used the following command. >>>>>>> >>>>>>> ../openmpi-1.8.2rc3/configure >>> --prefix=/usr/local/openmpi-1.8.2_64_gcc \ >>>>>>> --libdir=/usr/local/openmpi-1.8.2_64_gcc/lib64 \ >>>>>>> --with-jdk-bindir=/usr/local/jdk1.8.0/bin \ >>>>>>> --with-jdk-headers=/usr/local/jdk1.8.0/include \ >>>>>>> JAVA_HOME=/usr/local/jdk1.8.0 \ >>>>>>> LDFLAGS="-m64 -L/usr/local/gcc-4.9.0/lib/amd64" \ >>>>>>> CC="gcc" CXX="g++" FC="gfortran" \ >>>>>>> CFLAGS="-m64" CXXFLAGS="-m64" FCFLAGS="-m64" \ >>>>>>> CPP="cpp" CXXCPP="cpp" \ >>>>>>> CPPFLAGS="" CXXCPPFLAGS="" \ >>>>>>> --enable-mpi-cxx \ >>>>>>> --enable-cxx-exceptions \ >>>>>>> --enable-mpi-java \ >>>>>>> --enable-heterogeneous \ >>>>>>> --enable-mpi-thread-multiple \ >>>>>>> --with-threads=posix \ >>>>>>> --with-hwloc=internal \ >>>>>>> --without-verbs \ >>>>>>> --with-wrapper-cflags="-std=c11 -m64" \ >>>>>>> --enable-debug \ >>>>>>> |& tee log.configure.$SYSTEM_ENV.$MACHINE_ENV.64_gcc >>>>>>> >>>>>>> >>>>>>> >>>>>>>> If so, you should get a line number in the backtrace >>>>>>> I got them for gdb (see below), but not for "dbx". >>>>>>> >>>>>>> >>>>>>> Kind regards >>>>>>> >>>>>>> Siegmar >>>>>>> >>>>>>> >>>>>>> >>>>>>>> On Aug 5, 2014, at 2:59 AM, Siegmar Gross >>>>>>> <siegmar.gr...@informatik.hs-fulda.de> wrote: >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> I'm sorry to answer so late, but last week I didn't have Internet >>>>>>>>> access. In the meantime I've installed openmpi-1.8.2rc3 and I get >>>>>>>>> the same error. >>>>>>>>> >>>>>>>>>> This looks like the typical type of alignment error that we used >>>>>>>>>> to see when testing regularly on SPARC. :-\ >>>>>>>>>> >>>>>>>>>> It looks like the error was happening in mca_db_hash.so. Could >>>>>>>>>> you get a stack trace / file+line number where it was failing >>>>>>>>>> in mca_db_hash? (i.e., the actual bad code will likely be under >>>>>>>>>> opal/mca/db/hash somewhere) >>>>>>>>> Unfortunately I don't get a file+line number from a file in >>>>>>>>> opal/mca/db/Hash. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> tyr small_prog 102 ompi_info | grep MPI: >>>>>>>>> Open MPI: 1.8.2rc3 >>>>>>>>> tyr small_prog 103 which mpicc >>>>>>>>> /usr/local/openmpi-1.8.2_64_gcc/bin/mpicc >>>>>>>>> tyr small_prog 104 mpicc init_finalize.c >>>>>>>>> tyr small_prog 106 /opt/solstudio12.3/bin/sparcv9/dbx >>>>>>> /usr/local/openmpi-1.8.2_64_gcc/bin/mpiexec >>>>>>>>> For information about new features see `help changes' >>>>>>>>> To remove this message, put `dbxenv suppress_startup_message 7.9' >>> in your >>>>>>> .dbxrc >>>>>>>>> Reading mpiexec >>>>>>>>> Reading ld.so.1 >>>>>>>>> Reading libopen-rte.so.7.0.4 >>>>>>>>> Reading libopen-pal.so.6.2.0 >>>>>>>>> Reading libsendfile.so.1 >>>>>>>>> Reading libpicl.so.1 >>>>>>>>> Reading libkstat.so.1 >>>>>>>>> Reading liblgrp.so.1 >>>>>>>>> Reading libsocket.so.1 >>>>>>>>> Reading libnsl.so.1 >>>>>>>>> Reading libgcc_s.so.1 >>>>>>>>> Reading librt.so.1 >>>>>>>>> Reading libm.so.2 >>>>>>>>> Reading libpthread.so.1 >>>>>>>>> Reading libc.so.1 >>>>>>>>> Reading libdoor.so.1 >>>>>>>>> Reading libaio.so.1 >>>>>>>>> Reading libmd.so.1 >>>>>>>>> (dbx) check -all >>>>>>>>> access checking - ON >>>>>>>>> memuse checking - ON >>>>>>>>> (dbx) run -np 1 a.outRunning: mpiexec -np 1 a.out >>>>>>>>> (process id 27833) >>>>>>>>> Reading rtcapihook.so >>>>>>>>> Reading libdl.so.1 >>>>>>>>> Reading rtcaudit.so >>>>>>>>> Reading libmapmalloc.so.1 >>>>>>>>> Reading libgen.so.1 >>>>>>>>> Reading libc_psr.so.1 >>>>>>>>> Reading rtcboot.so >>>>>>>>> Reading librtc.so >>>>>>>>> Reading libmd_psr.so.1 >>>>>>>>> RTC: Enabling Error Checking... >>>>>>>>> RTC: Running program... >>>>>>>>> Write to unallocated (wua) on thread 1: >>>>>>>>> Attempting to write 1 byte at address 0xffffffff79f04000 >>>>>>>>> t@1 (l@1) stopped in _readdir at 0xffffffff55174da0 >>>>>>>>> 0xffffffff55174da0: _readdir+0x0064: call >>>>>>> _PROCEDURE_LINKAGE_TABLE_+0x2380 [PLT] ! 0xffffffff55342a80 >>>>>>>>> (dbx) where >>>>>>>>> current thread: t@1 >>>>>>>>> =>[1] _readdir(0xffffffff79f00300, 0x2e6800, 0x4, 0x2d, 0x4, >>>>>>> 0xffffffff79f00300), at 0xffffffff55174da0 >>>>>>>>> [2] list_files_by_dir(0x100138fd8, 0xffffffff7fffd1f0, >>> 0xffffffff7fffd1e8, >>>>>>> 0xffffffff7fffd210, 0x0, 0xffffffff702a0010), at >>>>>>>>> 0xffffffff63174594 >>>>>>>>> [3] foreachfile_callback(0x100138fd8, 0xffffffff7fffd458, 0x0, >>> 0x2e, 0x0, >>>>>>> 0xffffffff702a0010), at 0xffffffff6317461c >>>>>>>>> [4] foreach_dirinpath(0x1001d8a28, 0x0, 0xffffffff631745e0, >>>>>>> 0xffffffff7fffd458, 0x0, 0xffffffff702a0010), at 0xffffffff63171684 >>>>>>>>> [5] lt_dlforeachfile(0x1001d8a28, 0xffffffff6319656c, 0x0, 0x53, >>> 0x2f, >>>>>>> 0xf), at 0xffffffff63174748 >>>>>>>>> [6] find_dyn_components(0x0, 0xffffffff6323b570, 0x0, 0x1, >>>>>>> 0xffffffff7fffd6a0, 0xffffffff702a0010), at 0xffffffff63195e38 >>>>>>>>> [7] mca_base_component_find(0x0, 0xffffffff6323b570, >>> 0xffffffff6335e1b0, >>>>>>> 0x0, 0xffffffff7fffd6a0, 0x1), at 0xffffffff631954d8 >>>>>>>>> [8] mca_base_framework_components_register(0xffffffff6335e1c0, >>> 0x0, 0x3e, >>>>>>> 0x0, 0x3b, 0x100800), at 0xffffffff631b1638 >>>>>>>>> [9] mca_base_framework_register(0xffffffff6335e1c0, 0x0, 0x2, >>>>>>> 0xffffffff7fffd8d0, 0x0, 0xffffffff702a0010), at 0xffffffff631b24d4 >>>>>>>>> [10] mca_base_framework_open(0xffffffff6335e1c0, 0x0, 0x2, >>>>>>> 0xffffffff7fffd990, 0x0, 0xffffffff702a0010), at 0xffffffff631b25d0 >>>>>>>>> [11] opal_init(0xffffffff7fffdd70, 0xffffffff7fffdd78, 0x100117c60, >>>>>>> 0xffffffff7fffde58, 0x400, 0x100117c60), at >>>>>>>>> 0xffffffff63153694 >>>>>>>>> [12] orterun(0x4, 0xffffffff7fffde58, 0x2, 0xffffffff7fffdda0, 0x0, >>>>>>> 0xffffffff702a0010), at 0x100005078 >>>>>>>>> [13] main(0x4, 0xffffffff7fffde58, 0xffffffff7fffde80, 0x100117c60, >>>>>>> 0x100000000, 0xffffffff6a700200), at 0x100003d68 >>>>>>>>> (dbx) >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> I get the following output with gdb. >>>>>>>>> >>>>>>>>> tyr small_prog 107 /usr/local/gdb-7.6.1_64_gcc/bin/gdb >>>>>>> /usr/local/openmpi-1.8.2_64_gcc/bin/mpiexec >>>>>>>>> GNU gdb (GDB) 7.6.1 >>>>>>>>> Copyright (C) 2013 Free Software Foundation, Inc. >>>>>>>>> License GPLv3+: GNU GPL version 3 or later >>>>>>> <http://gnu.org/licenses/gpl.html> >>>>>>>>> This is free software: you are free to change and redistribute it. >>>>>>>>> There is NO WARRANTY, to the extent permitted by law. Type "show >>> copying" >>>>>>>>> and "show warranty" for details. >>>>>>>>> This GDB was configured as "sparc-sun-solaris2.10". >>>>>>>>> For bug reporting instructions, please see: >>>>>>>>> <http://www.gnu.org/software/gdb/bugs/>... >>>>>>>>> Reading symbols from >>>>>>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/bin/orterun...done. >>>>>>>>> (gdb) run -np 1 a.out >>>>>>>>> Starting program: /usr/local/openmpi-1.8.2_64_gcc/bin/mpiexec -np 1 >>> a.out >>>>>>>>> [Thread debugging using libthread_db enabled] >>>>>>>>> [New Thread 1 (LWP 1)] >>>>>>>>> [New LWP 2 ] >>>>>>>>> [tyr:27867] *** Process received signal *** >>>>>>>>> [tyr:27867] Signal: Bus Error (10) >>>>>>>>> [tyr:27867] Signal code: Invalid address alignment (1) >>>>>>>>> [tyr:27867] Failing at address: ffffffff7fffd224 >>>>>>>>> >>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:opal_b >>>>>>> acktrace_print+0x2c >>>>>>> >>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:0xccfa >>>>>>> 0 >>>>>>>>> /lib/sparcv9/libc.so.1:0xd8b98 >>>>>>>>> /lib/sparcv9/libc.so.1:0xcc70c >>>>>>>>> /lib/sparcv9/libc.so.1:0xcc918 >>>>>>>>> >>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/openmpi/mca_db_hash.so:0x3e >>>>>>> e8 [ Signal 10 (BUS)] >>>>>>> >>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:opal_d >>>>>>> b_base_store+0xc8 >>>>>>> >>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_u >>>>>>> til_decode_pidmap+0x798 >>>>>>> >>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_u >>>>>>> til_nidmap_init+0x3cc >>>>>>> >>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/openmpi/mca_ess_env.so:0x22 >>>>>>> 6c >>>>>>> >>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_i >>>>>>> nit+0x308 >>>>>>> >>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libmpi.so.1.5.2:ompi_mpi_in >>>>>>> it+0x31c >>>>>>> >>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libmpi.so.1.5.2:PMPI_Init+0 >>>>>>> x2a8 >>>>>>> >>> /home/fd1026/work/skripte/master/parallel/prog/mpi/small_prog/a.out:main+0x20 >>> /home/fd1026/work/skripte/master/parallel/prog/mpi/small_prog/a.out:_start+0x7c >>>>>>>>> [tyr:27867] *** End of error message *** >>>>>>>>> >>> -------------------------------------------------------------------------- >>>>>>>>> mpiexec noticed that process rank 0 with PID 27867 on node tyr >>> exited on >>>>>>> signal 10 (Bus Error). >>> -------------------------------------------------------------------------- >>>>>>>>> [LWP 2 exited] >>>>>>>>> [New Thread 2 ] >>>>>>>>> [Switching to Thread 1 (LWP 1)] >>>>>>>>> sol_thread_fetch_registers: td_ta_map_id2thr: no thread can be >>> found to >>>>>>> satisfy query >>>>>>>>> (gdb) bt >>>>>>>>> #0 0xffffffff7f6173d0 in rtld_db_dlactivity () from >>>>>>> /usr/lib/sparcv9/ld.so.1 >>>>>>>>> #1 0xffffffff7f6175a8 in rd_event () from /usr/lib/sparcv9/ld.so.1 >>>>>>>>> #2 0xffffffff7f618950 in lm_delete () from /usr/lib/sparcv9/ld.so.1 >>>>>>>>> #3 0xffffffff7f6226bc in remove_so () from /usr/lib/sparcv9/ld.so.1 >>>>>>>>> #4 0xffffffff7f624574 in remove_hdl () from >>> /usr/lib/sparcv9/ld.so.1 >>>>>>>>> #5 0xffffffff7f61d97c in dlclose_core () from >>> /usr/lib/sparcv9/ld.so.1 >>>>>>>>> #6 0xffffffff7f61d9d4 in dlclose_intn () from >>> /usr/lib/sparcv9/ld.so.1 >>>>>>>>> #7 0xffffffff7f61db0c in dlclose () from /usr/lib/sparcv9/ld.so.1 >>>>>>>>> #8 0xffffffff7ec7746c in vm_close () >>>>>>>>> from /usr/local/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6 >>>>>>>>> #9 0xffffffff7ec74a4c in lt_dlclose () >>>>>>>>> from /usr/local/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6 >>>>>>>>> #10 0xffffffff7ec99b70 in ri_destructor (obj=0x1001ead30) >>>>>>>>> at >>> ../../../../openmpi-1.8.2rc3/opal/mca/base/mca_base_component_repository.c:391 >>>>>>>>> #11 0xffffffff7ec98488 in opal_obj_run_destructors >>> (object=0x1001ead30) >>>>>>>>> at ../../../../openmpi-1.8.2rc3/opal/class/opal_object.h:446 >>>>>>>>> #12 0xffffffff7ec993ec in mca_base_component_repository_release ( >>>>>>>>> component=0xffffffff7b023cf0 <mca_oob_tcp_component>) >>>>>>>>> at >>> ../../../../openmpi-1.8.2rc3/opal/mca/base/mca_base_component_repository.c:244 >>>>>>>>> #13 0xffffffff7ec9b734 in mca_base_component_unload ( >>>>>>>>> component=0xffffffff7b023cf0 <mca_oob_tcp_component>, >>> output_id=-1) >>>>>>>>> at >>> ../../../../openmpi-1.8.2rc3/opal/mca/base/mca_base_components_close.c:47 >>>>>>>>> #14 0xffffffff7ec9b7c8 in mca_base_component_close ( >>>>>>>>> component=0xffffffff7b023cf0 <mca_oob_tcp_component>, >>> output_id=-1) >>>>>>>>> at >>> ../../../../openmpi-1.8.2rc3/opal/mca/base/mca_base_components_close.c:60 >>>>>>>>> #15 0xffffffff7ec9b89c in mca_base_components_close (output_id=-1, >>>>>>>>> components=0xffffffff7f12b430 <orte_oob_base_framework+80>, >>> skip=0x0) >>>>>>>>> ---Type <return> to continue, or q <return> to quit--- >>>>>>>>> at >>> ../../../../openmpi-1.8.2rc3/opal/mca/base/mca_base_components_close.c:86 >>>>>>>>> #16 0xffffffff7ec9b804 in mca_base_framework_components_close ( >>>>>>>>> framework=0xffffffff7f12b3e0 <orte_oob_base_framework>, skip=0x0) >>>>>>>>> at >>> ../../../../openmpi-1.8.2rc3/opal/mca/base/mca_base_components_close.c:66 >>>>>>>>> #17 0xffffffff7efae1e4 in orte_oob_base_close () >>>>>>>>> at >>> ../../../../openmpi-1.8.2rc3/orte/mca/oob/base/oob_base_frame.c:94 >>>>>>>>> #18 0xffffffff7ecb28ac in mca_base_framework_close ( >>>>>>>>> framework=0xffffffff7f12b3e0 <orte_oob_base_framework>) >>>>>>>>> at >>> ../../../../openmpi-1.8.2rc3/opal/mca/base/mca_base_framework.c:187 >>>>>>>>> #19 0xffffffff7bf078c0 in rte_finalize () >>>>>>>>> at >>> ../../../../../openmpi-1.8.2rc3/orte/mca/ess/hnp/ess_hnp_module.c:858 >>>>>>>>> #20 0xffffffff7ef30a44 in orte_finalize () >>>>>>>>> at ../../openmpi-1.8.2rc3/orte/runtime/orte_finalize.c:65 >>>>>>>>> #21 0x00000001000070c4 in orterun (argc=4, argv=0xffffffff7fffe0e8) >>>>>>>>> at ../../../../openmpi-1.8.2rc3/orte/tools/orterun/orterun.c:1096 >>>>>>>>> #22 0x0000000100003d70 in main (argc=4, argv=0xffffffff7fffe0e8) >>>>>>>>> at ../../../../openmpi-1.8.2rc3/orte/tools/orterun/main.c:13 >>>>>>>>> (gdb) >>>>>>>>> >>>>>>>>> >>>>>>>>> Is the above information helpful to track down the error? Do you >>> need >>>>>>>>> anything else? Thank you very much for any help in advance. >>>>>>>>> >>>>>>>>> >>>>>>>>> Kind regards >>>>>>>>> >>>>>>>>> Siegmar >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>> On Jul 25, 2014, at 2:08 AM, Siegmar Gross >>>>>>> <siegmar.gr...@informatik.hs-fulda.de> wrote: >>>>>>>>>>> Hi, >>>>>>>>>>> >>>>>>>>>>> I have installed openmpi-1.8.2rc2 with gcc-4.9.0 on Solaris >>>>>>>>>>> 10 Sparc and I receive a bus error, if I run a small program. >>>>>>>>>>> >>>>>>>>>>> tyr hello_1 105 mpiexec -np 2 a.out >>>>>>>>>>> [tyr:29164] *** Process received signal *** >>>>>>>>>>> [tyr:29164] Signal: Bus Error (10) >>>>>>>>>>> [tyr:29164] Signal code: Invalid address alignment (1) >>>>>>>>>>> [tyr:29164] Failing at address: ffffffff7fffd1c4 >>>>>>>>>>> >>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:opal_b >>>>>>> acktrace_print+0x2c >>>>>>> >>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:0xccfd >>>>>>> 0 >>>>>>>>>>> /lib/sparcv9/libc.so.1:0xd8b98 >>>>>>>>>>> /lib/sparcv9/libc.so.1:0xcc70c >>>>>>>>>>> /lib/sparcv9/libc.so.1:0xcc918 >>>>>>>>>>> >>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/openmpi/mca_db_hash.so:0x3e >>>>>>> e8 [ Signal 10 (BUS)] >>>>>>> >>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:opal_d >>>>>>> b_base_store+0xc8 >>>>>>> >>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_u >>>>>>> til_decode_pidmap+0x798 >>>>>>> >>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_u >>>>>>> til_nidmap_init+0x3cc >>>>>>> >>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/openmpi/mca_ess_env.so:0x22 >>>>>>> 6c >>>>>>> >>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_i >>>>>>> nit+0x308 >>>>>>> >>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libmpi.so.1.5.2:ompi_mpi_in >>>>>>> it+0x31c >>>>>>> >>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libmpi.so.1.5.2:PMPI_Init+0 >>>>>>> x2a8 >>> /home/fd1026/work/skripte/master/parallel/prog/mpi/hello_1/a.out:main+0x20 >>> /home/fd1026/work/skripte/master/parallel/prog/mpi/hello_1/a.out:_start+0x7c >>>>>>>>>>> [tyr:29164] *** End of error message *** >>>>>>>>>>> ... >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> I get the following output if I run the program in "dbx". >>>>>>>>>>> >>>>>>>>>>> ... >>>>>>>>>>> RTC: Enabling Error Checking... >>>>>>>>>>> RTC: Running program... >>>>>>>>>>> Write to unallocated (wua) on thread 1: >>>>>>>>>>> Attempting to write 1 byte at address 0xffffffff79f04000 >>>>>>>>>>> t@1 (l@1) stopped in _readdir at 0xffffffff55174da0 >>>>>>>>>>> 0xffffffff55174da0: _readdir+0x0064: call >>>>>>> _PROCEDURE_LINKAGE_TABLE_+0x2380 [PLT] ! 0xffffffff55342a80 >>>>>>>>>>> (dbx) >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Hopefully the above output helps to fix the error. Can I provide >>>>>>>>>>> anything else? Thank you very much for any help in advance. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Kind regards >>>>>>>>>>> >>>>>>>>>>> Siegmar >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2014/08/15546.php >>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2014/08/15547.php >>> >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/08/15549.php > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/08/15550.php