Kawashima-san and all, Here is attached a one off patch for v1.8. /* it does not use the __attribute__ modifier that might not be supported by all compilers */
as far as i am concerned, the same issue is also in the trunk, and if you do not hit it, it just means you are lucky :-) the same issue might also be in other parts of the code :-( Cheers, Gilles On 2014/08/08 13:45, Kawashima, Takahiro wrote: > Gilles, George, > > The problem is the one Gilles pointed. > I temporarily modified the code bellow and the bus error disappeared. > > --- orte/util/nidmap.c (revision 32447) > +++ orte/util/nidmap.c (working copy) > @@ -885,7 +885,7 @@ > orte_proc_state_t state; > orte_app_idx_t app_idx; > int32_t restarts; > - orte_process_name_t proc, dmn; > + orte_process_name_t proc __attribute__((__aligned__(8))), dmn; > char *hostname; > uint8_t flag; > opal_buffer_t *bptr; > > Takahiro Kawashima, > MPI development team, > Fujitsu > >> Kawashima-san, >> >> This is interesting :-) >> >> proc is in the stack and has type orte_process_name_t >> >> with >> >> typedef uint32_t orte_jobid_t; >> typedef uint32_t orte_vpid_t; >> struct orte_process_name_t { >> orte_jobid_t jobid; /**< Job number */ >> orte_vpid_t vpid; /**< Process id - equivalent to rank */ >> }; >> typedef struct orte_process_name_t orte_process_name_t; >> >> >> so there is really no reason to align this on 8 bytes... >> but later, proc is casted into an uint64_t ... >> so proc should have been aligned on 8 bytes but it is too late, >> and hence the glory SIGBUS >> >> >> this is loosely related to >> http://www.open-mpi.org/community/lists/devel/2014/08/15532.php >> (see heterogeneous.v2.patch) >> if we make opal_process_name_t an union of uint64_t and a struct of two >> uint32_t, the compiler >> will align this on 8 bytes. >> note the patch is not enough (and will not apply on the v1.8 branch anyway), >> we could simply remove orte_process_name_t and ompi_process_name_t and >> use only >> opal_process_name_t (and never declare variables with type >> opal_proc_name_t otherwise alignment might be incorrect) >> >> as a workaround, you can declare an opal_process_name_t (for alignment), >> and cast it to an orte_process_name_t >> >> i will write a patch (i will not be able to test on sparc ...) >> please note this issue might be present in other places >> >> Cheers, >> >> Gilles >> >> On 2014/08/08 13:03, Kawashima, Takahiro wrote: >>> Hi, >>> >>>>>>>> I have installed openmpi-1.8.2rc2 with gcc-4.9.0 on Solaris >>>>>>>> 10 Sparc and I receive a bus error, if I run a small program. >>> I've finally reproduced the bus error in my SPARC environment. >>> >>> #0 0xffffffff00db4740 (__waitpid_nocancel + 0x44) >>> (0x200,0x0,0x0,0xa0,0xfffff80100064af0,0x35b4) >>> #1 0xffffffff0001a310 (handle_signal + 0x574) (signo=10,info=(struct >>> siginfo *) 0x000007feffffd100,p=(void *) 0x000007feffffd100) at line 277 in >>> ../sigattach.c <SIGNAL HANDLER> >>> #2 0xffffffff0282aff4 (store + 0x540) (uid=(unsigned long *) >>> 0xffffffff0118a128,scope=8:'\b',key=(char *) 0xffffffff0106a0a8 >>> "opal.local.ldr",data=(void *) 0x000007feffffde74,type=15:'\017') at line >>> 252 in db_hash.c >>> #3 0xffffffff01266350 (opal_db_base_store + 0xc4) (proc=(unsigned long *) >>> 0xffffffff0118a128,scope=8:'\b',key=(char *) 0xffffffff0106a0a8 >>> "opal.local.ldr",object=(void *) 0x000007feffffde74,type=15:'\017') at line >>> 49 in db_base_fns.c >>> #4 0xffffffff00fdbab4 (orte_util_decode_pidmap + 0x790) (bo=(struct *) >>> 0x0000000000281d70) at line 975 in nidmap.c >>> #5 0xffffffff00fd6d20 (orte_util_nidmap_init + 0x3dc) (buffer=(struct >>> opal_buffer_t *) 0x0000000000241fc0) at line 141 in nidmap.c >>> #6 0xffffffff01e298cc (rte_init + 0x2a0) () at line 153 in ess_env_module.c >>> #7 0xffffffff00f9f28c (orte_init + 0x308) (pargc=(int *) >>> 0x0000000000000000,pargv=(char ***) 0x0000000000000000,flags=32) at line >>> 148 in orte_init.c >>> #8 0xffffffff001a6f08 (ompi_mpi_init + 0x31c) (argc=1,argv=(char **) >>> 0x000007fefffff348,requested=0,provided=(int *) 0x000007feffffe698) at line >>> 464 in ompi_mpi_init.c >>> #9 0xffffffff001ff79c (MPI_Init + 0x2b0) (argc=(int *) >>> 0x000007feffffe814,argv=(char ***) 0x000007feffffe818) at line 84 in init.c >>> #10 0x0000000000100ae4 (main + 0x44) (argc=1,argv=(char **) >>> 0x000007fefffff348) at line 8 in mpiinitfinalize.c >>> #11 0xffffffff00d2b81c (__libc_start_main + 0x194) >>> (0x100aa0,0x1,0x7fefffff348,0x100d24,0x100d14,0x0) >>> #12 0x000000000010094c (_start + 0x2c) () >>> >>> The line 252 in opal/mca/db/hash/db_hash.c is: >>> >>> case OPAL_UINT64: >>> if (NULL == data) { >>> OPAL_ERROR_LOG(OPAL_ERR_BAD_PARAM); >>> return OPAL_ERR_BAD_PARAM; >>> } >>> kv->type = OPAL_UINT64; >>> kv->data.uint64 = *(uint64_t*)(data); // !!! here !!! >>> break; >>> >>> My environment is: >>> >>> Open MPI v1.8 branch r32447 (latest) >>> configure --enable-debug >>> SPARC-V9 (Fujitsu SPARC64 IXfx) >>> Linux (custom) >>> gcc 4.2.4 >>> >>> I could not reproduce it with Open MPI trunk nor with Fujitsu compiler. >>> >>> Can this information help? >>> >>> Takahiro Kawashima, >>> MPI development team, >>> Fujitsu >>> >>>> Hi, >>>> >>>> I'm sorry once more to answer late, but the last two days our mail >>>> server was down (hardware error). >>>> >>>>> Did you configure this --enable-debug? >>>> Yes, I used the following command. >>>> >>>> ../openmpi-1.8.2rc3/configure --prefix=/usr/local/openmpi-1.8.2_64_gcc \ >>>> --libdir=/usr/local/openmpi-1.8.2_64_gcc/lib64 \ >>>> --with-jdk-bindir=/usr/local/jdk1.8.0/bin \ >>>> --with-jdk-headers=/usr/local/jdk1.8.0/include \ >>>> JAVA_HOME=/usr/local/jdk1.8.0 \ >>>> LDFLAGS="-m64 -L/usr/local/gcc-4.9.0/lib/amd64" \ >>>> CC="gcc" CXX="g++" FC="gfortran" \ >>>> CFLAGS="-m64" CXXFLAGS="-m64" FCFLAGS="-m64" \ >>>> CPP="cpp" CXXCPP="cpp" \ >>>> CPPFLAGS="" CXXCPPFLAGS="" \ >>>> --enable-mpi-cxx \ >>>> --enable-cxx-exceptions \ >>>> --enable-mpi-java \ >>>> --enable-heterogeneous \ >>>> --enable-mpi-thread-multiple \ >>>> --with-threads=posix \ >>>> --with-hwloc=internal \ >>>> --without-verbs \ >>>> --with-wrapper-cflags="-std=c11 -m64" \ >>>> --enable-debug \ >>>> |& tee log.configure.$SYSTEM_ENV.$MACHINE_ENV.64_gcc >>>> >>>> >>>> >>>>> If so, you should get a line number in the backtrace >>>> I got them for gdb (see below), but not for "dbx". >>>> >>>> >>>> Kind regards >>>> >>>> Siegmar >>>> >>>> >>>> >>>>> On Aug 5, 2014, at 2:59 AM, Siegmar Gross >>>> <siegmar.gr...@informatik.hs-fulda.de> wrote: >>>>>> Hi, >>>>>> >>>>>> I'm sorry to answer so late, but last week I didn't have Internet >>>>>> access. In the meantime I've installed openmpi-1.8.2rc3 and I get >>>>>> the same error. >>>>>> >>>>>>> This looks like the typical type of alignment error that we used >>>>>>> to see when testing regularly on SPARC. :-\ >>>>>>> >>>>>>> It looks like the error was happening in mca_db_hash.so. Could >>>>>>> you get a stack trace / file+line number where it was failing >>>>>>> in mca_db_hash? (i.e., the actual bad code will likely be under >>>>>>> opal/mca/db/hash somewhere) >>>>>> Unfortunately I don't get a file+line number from a file in >>>>>> opal/mca/db/Hash. >>>>>> >>>>>> >>>>>> >>>>>> tyr small_prog 102 ompi_info | grep MPI: >>>>>> Open MPI: 1.8.2rc3 >>>>>> tyr small_prog 103 which mpicc >>>>>> /usr/local/openmpi-1.8.2_64_gcc/bin/mpicc >>>>>> tyr small_prog 104 mpicc init_finalize.c >>>>>> tyr small_prog 106 /opt/solstudio12.3/bin/sparcv9/dbx >>>> /usr/local/openmpi-1.8.2_64_gcc/bin/mpiexec >>>>>> For information about new features see `help changes' >>>>>> To remove this message, put `dbxenv suppress_startup_message 7.9' in >>>>>> your >>>> .dbxrc >>>>>> Reading mpiexec >>>>>> Reading ld.so.1 >>>>>> Reading libopen-rte.so.7.0.4 >>>>>> Reading libopen-pal.so.6.2.0 >>>>>> Reading libsendfile.so.1 >>>>>> Reading libpicl.so.1 >>>>>> Reading libkstat.so.1 >>>>>> Reading liblgrp.so.1 >>>>>> Reading libsocket.so.1 >>>>>> Reading libnsl.so.1 >>>>>> Reading libgcc_s.so.1 >>>>>> Reading librt.so.1 >>>>>> Reading libm.so.2 >>>>>> Reading libpthread.so.1 >>>>>> Reading libc.so.1 >>>>>> Reading libdoor.so.1 >>>>>> Reading libaio.so.1 >>>>>> Reading libmd.so.1 >>>>>> (dbx) check -all >>>>>> access checking - ON >>>>>> memuse checking - ON >>>>>> (dbx) run -np 1 a.outRunning: mpiexec -np 1 a.out >>>>>> (process id 27833) >>>>>> Reading rtcapihook.so >>>>>> Reading libdl.so.1 >>>>>> Reading rtcaudit.so >>>>>> Reading libmapmalloc.so.1 >>>>>> Reading libgen.so.1 >>>>>> Reading libc_psr.so.1 >>>>>> Reading rtcboot.so >>>>>> Reading librtc.so >>>>>> Reading libmd_psr.so.1 >>>>>> RTC: Enabling Error Checking... >>>>>> RTC: Running program... >>>>>> Write to unallocated (wua) on thread 1: >>>>>> Attempting to write 1 byte at address 0xffffffff79f04000 >>>>>> t@1 (l@1) stopped in _readdir at 0xffffffff55174da0 >>>>>> 0xffffffff55174da0: _readdir+0x0064: call >>>> _PROCEDURE_LINKAGE_TABLE_+0x2380 [PLT] ! 0xffffffff55342a80 >>>>>> (dbx) where >>>>>> current thread: t@1 >>>>>> =>[1] _readdir(0xffffffff79f00300, 0x2e6800, 0x4, 0x2d, 0x4, >>>> 0xffffffff79f00300), at 0xffffffff55174da0 >>>>>> [2] list_files_by_dir(0x100138fd8, 0xffffffff7fffd1f0, >>>>>> 0xffffffff7fffd1e8, >>>> 0xffffffff7fffd210, 0x0, 0xffffffff702a0010), at >>>>>> 0xffffffff63174594 >>>>>> [3] foreachfile_callback(0x100138fd8, 0xffffffff7fffd458, 0x0, 0x2e, >>>>>> 0x0, >>>> 0xffffffff702a0010), at 0xffffffff6317461c >>>>>> [4] foreach_dirinpath(0x1001d8a28, 0x0, 0xffffffff631745e0, >>>> 0xffffffff7fffd458, 0x0, 0xffffffff702a0010), at 0xffffffff63171684 >>>>>> [5] lt_dlforeachfile(0x1001d8a28, 0xffffffff6319656c, 0x0, 0x53, 0x2f, >>>> 0xf), at 0xffffffff63174748 >>>>>> [6] find_dyn_components(0x0, 0xffffffff6323b570, 0x0, 0x1, >>>> 0xffffffff7fffd6a0, 0xffffffff702a0010), at 0xffffffff63195e38 >>>>>> [7] mca_base_component_find(0x0, 0xffffffff6323b570, >>>>>> 0xffffffff6335e1b0, >>>> 0x0, 0xffffffff7fffd6a0, 0x1), at 0xffffffff631954d8 >>>>>> [8] mca_base_framework_components_register(0xffffffff6335e1c0, 0x0, >>>>>> 0x3e, >>>> 0x0, 0x3b, 0x100800), at 0xffffffff631b1638 >>>>>> [9] mca_base_framework_register(0xffffffff6335e1c0, 0x0, 0x2, >>>> 0xffffffff7fffd8d0, 0x0, 0xffffffff702a0010), at 0xffffffff631b24d4 >>>>>> [10] mca_base_framework_open(0xffffffff6335e1c0, 0x0, 0x2, >>>> 0xffffffff7fffd990, 0x0, 0xffffffff702a0010), at 0xffffffff631b25d0 >>>>>> [11] opal_init(0xffffffff7fffdd70, 0xffffffff7fffdd78, 0x100117c60, >>>> 0xffffffff7fffde58, 0x400, 0x100117c60), at >>>>>> 0xffffffff63153694 >>>>>> [12] orterun(0x4, 0xffffffff7fffde58, 0x2, 0xffffffff7fffdda0, 0x0, >>>> 0xffffffff702a0010), at 0x100005078 >>>>>> [13] main(0x4, 0xffffffff7fffde58, 0xffffffff7fffde80, 0x100117c60, >>>> 0x100000000, 0xffffffff6a700200), at 0x100003d68 >>>>>> (dbx) >>>>>> >>>>>> >>>>>> >>>>>> I get the following output with gdb. >>>>>> >>>>>> tyr small_prog 107 /usr/local/gdb-7.6.1_64_gcc/bin/gdb >>>> /usr/local/openmpi-1.8.2_64_gcc/bin/mpiexec >>>>>> GNU gdb (GDB) 7.6.1 >>>>>> Copyright (C) 2013 Free Software Foundation, Inc. >>>>>> License GPLv3+: GNU GPL version 3 or later >>>> <http://gnu.org/licenses/gpl.html> >>>>>> This is free software: you are free to change and redistribute it. >>>>>> There is NO WARRANTY, to the extent permitted by law. Type "show >>>>>> copying" >>>>>> and "show warranty" for details. >>>>>> This GDB was configured as "sparc-sun-solaris2.10". >>>>>> For bug reporting instructions, please see: >>>>>> <http://www.gnu.org/software/gdb/bugs/>... >>>>>> Reading symbols from >>>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/bin/orterun...done. >>>>>> (gdb) run -np 1 a.out >>>>>> Starting program: /usr/local/openmpi-1.8.2_64_gcc/bin/mpiexec -np 1 a.out >>>>>> [Thread debugging using libthread_db enabled] >>>>>> [New Thread 1 (LWP 1)] >>>>>> [New LWP 2 ] >>>>>> [tyr:27867] *** Process received signal *** >>>>>> [tyr:27867] Signal: Bus Error (10) >>>>>> [tyr:27867] Signal code: Invalid address alignment (1) >>>>>> [tyr:27867] Failing at address: ffffffff7fffd224 >>>>>> >>>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:opal_b >>>> acktrace_print+0x2c >>>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:0xccfa >>>> 0 >>>>>> /lib/sparcv9/libc.so.1:0xd8b98 >>>>>> /lib/sparcv9/libc.so.1:0xcc70c >>>>>> /lib/sparcv9/libc.so.1:0xcc918 >>>>>> >>>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/openmpi/mca_db_hash.so:0x3e >>>> e8 [ Signal 10 (BUS)] >>>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:opal_d >>>> b_base_store+0xc8 >>>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_u >>>> til_decode_pidmap+0x798 >>>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_u >>>> til_nidmap_init+0x3cc >>>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/openmpi/mca_ess_env.so:0x22 >>>> 6c >>>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_i >>>> nit+0x308 >>>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libmpi.so.1.5.2:ompi_mpi_in >>>> it+0x31c >>>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libmpi.so.1.5.2:PMPI_Init+0 >>>> x2a8 >>>> /home/fd1026/work/skripte/master/parallel/prog/mpi/small_prog/a.out:main+0x20 >>>> /home/fd1026/work/skripte/master/parallel/prog/mpi/small_prog/a.out:_start+0x7c >>>>>> [tyr:27867] *** End of error message *** >>>>>> -------------------------------------------------------------------------- >>>>>> mpiexec noticed that process rank 0 with PID 27867 on node tyr exited on >>>> signal 10 (Bus Error). >>>>>> -------------------------------------------------------------------------- >>>>>> [LWP 2 exited] >>>>>> [New Thread 2 ] >>>>>> [Switching to Thread 1 (LWP 1)] >>>>>> sol_thread_fetch_registers: td_ta_map_id2thr: no thread can be found to >>>> satisfy query >>>>>> (gdb) bt >>>>>> #0 0xffffffff7f6173d0 in rtld_db_dlactivity () from >>>> /usr/lib/sparcv9/ld.so.1 >>>>>> #1 0xffffffff7f6175a8 in rd_event () from /usr/lib/sparcv9/ld.so.1 >>>>>> #2 0xffffffff7f618950 in lm_delete () from /usr/lib/sparcv9/ld.so.1 >>>>>> #3 0xffffffff7f6226bc in remove_so () from /usr/lib/sparcv9/ld.so.1 >>>>>> #4 0xffffffff7f624574 in remove_hdl () from /usr/lib/sparcv9/ld.so.1 >>>>>> #5 0xffffffff7f61d97c in dlclose_core () from /usr/lib/sparcv9/ld.so.1 >>>>>> #6 0xffffffff7f61d9d4 in dlclose_intn () from /usr/lib/sparcv9/ld.so.1 >>>>>> #7 0xffffffff7f61db0c in dlclose () from /usr/lib/sparcv9/ld.so.1 >>>>>> #8 0xffffffff7ec7746c in vm_close () >>>>>> from /usr/local/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6 >>>>>> #9 0xffffffff7ec74a4c in lt_dlclose () >>>>>> from /usr/local/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6 >>>>>> #10 0xffffffff7ec99b70 in ri_destructor (obj=0x1001ead30) >>>>>> at >>>> ../../../../openmpi-1.8.2rc3/opal/mca/base/mca_base_component_repository.c:391 >>>>>> #11 0xffffffff7ec98488 in opal_obj_run_destructors (object=0x1001ead30) >>>>>> at ../../../../openmpi-1.8.2rc3/opal/class/opal_object.h:446 >>>>>> #12 0xffffffff7ec993ec in mca_base_component_repository_release ( >>>>>> component=0xffffffff7b023cf0 <mca_oob_tcp_component>) >>>>>> at >>>> ../../../../openmpi-1.8.2rc3/opal/mca/base/mca_base_component_repository.c:244 >>>>>> #13 0xffffffff7ec9b734 in mca_base_component_unload ( >>>>>> component=0xffffffff7b023cf0 <mca_oob_tcp_component>, output_id=-1) >>>>>> at >>>> ../../../../openmpi-1.8.2rc3/opal/mca/base/mca_base_components_close.c:47 >>>>>> #14 0xffffffff7ec9b7c8 in mca_base_component_close ( >>>>>> component=0xffffffff7b023cf0 <mca_oob_tcp_component>, output_id=-1) >>>>>> at >>>> ../../../../openmpi-1.8.2rc3/opal/mca/base/mca_base_components_close.c:60 >>>>>> #15 0xffffffff7ec9b89c in mca_base_components_close (output_id=-1, >>>>>> components=0xffffffff7f12b430 <orte_oob_base_framework+80>, skip=0x0) >>>>>> ---Type <return> to continue, or q <return> to quit--- >>>>>> at >>>> ../../../../openmpi-1.8.2rc3/opal/mca/base/mca_base_components_close.c:86 >>>>>> #16 0xffffffff7ec9b804 in mca_base_framework_components_close ( >>>>>> framework=0xffffffff7f12b3e0 <orte_oob_base_framework>, skip=0x0) >>>>>> at >>>> ../../../../openmpi-1.8.2rc3/opal/mca/base/mca_base_components_close.c:66 >>>>>> #17 0xffffffff7efae1e4 in orte_oob_base_close () >>>>>> at ../../../../openmpi-1.8.2rc3/orte/mca/oob/base/oob_base_frame.c:94 >>>>>> #18 0xffffffff7ecb28ac in mca_base_framework_close ( >>>>>> framework=0xffffffff7f12b3e0 <orte_oob_base_framework>) >>>>>> at ../../../../openmpi-1.8.2rc3/opal/mca/base/mca_base_framework.c:187 >>>>>> #19 0xffffffff7bf078c0 in rte_finalize () >>>>>> at >>>>>> ../../../../../openmpi-1.8.2rc3/orte/mca/ess/hnp/ess_hnp_module.c:858 >>>>>> #20 0xffffffff7ef30a44 in orte_finalize () >>>>>> at ../../openmpi-1.8.2rc3/orte/runtime/orte_finalize.c:65 >>>>>> #21 0x00000001000070c4 in orterun (argc=4, argv=0xffffffff7fffe0e8) >>>>>> at ../../../../openmpi-1.8.2rc3/orte/tools/orterun/orterun.c:1096 >>>>>> #22 0x0000000100003d70 in main (argc=4, argv=0xffffffff7fffe0e8) >>>>>> at ../../../../openmpi-1.8.2rc3/orte/tools/orterun/main.c:13 >>>>>> (gdb) >>>>>> >>>>>> >>>>>> Is the above information helpful to track down the error? Do you need >>>>>> anything else? Thank you very much for any help in advance. >>>>>> >>>>>> >>>>>> Kind regards >>>>>> >>>>>> Siegmar >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> On Jul 25, 2014, at 2:08 AM, Siegmar Gross >>>> <siegmar.gr...@informatik.hs-fulda.de> wrote: >>>>>>>> Hi, >>>>>>>> >>>>>>>> I have installed openmpi-1.8.2rc2 with gcc-4.9.0 on Solaris >>>>>>>> 10 Sparc and I receive a bus error, if I run a small program. >>>>>>>> >>>>>>>> tyr hello_1 105 mpiexec -np 2 a.out >>>>>>>> [tyr:29164] *** Process received signal *** >>>>>>>> [tyr:29164] Signal: Bus Error (10) >>>>>>>> [tyr:29164] Signal code: Invalid address alignment (1) >>>>>>>> [tyr:29164] Failing at address: ffffffff7fffd1c4 >>>>>>>> >>>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:opal_b >>>> acktrace_print+0x2c >>>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:0xccfd >>>> 0 >>>>>>>> /lib/sparcv9/libc.so.1:0xd8b98 >>>>>>>> /lib/sparcv9/libc.so.1:0xcc70c >>>>>>>> /lib/sparcv9/libc.so.1:0xcc918 >>>>>>>> >>>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/openmpi/mca_db_hash.so:0x3e >>>> e8 [ Signal 10 (BUS)] >>>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:opal_d >>>> b_base_store+0xc8 >>>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_u >>>> til_decode_pidmap+0x798 >>>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_u >>>> til_nidmap_init+0x3cc >>>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/openmpi/mca_ess_env.so:0x22 >>>> 6c >>>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_i >>>> nit+0x308 >>>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libmpi.so.1.5.2:ompi_mpi_in >>>> it+0x31c >>>> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libmpi.so.1.5.2:PMPI_Init+0 >>>> x2a8 >>>>>>>> /home/fd1026/work/skripte/master/parallel/prog/mpi/hello_1/a.out:main+0x20 >>>>>>>> >>>> /home/fd1026/work/skripte/master/parallel/prog/mpi/hello_1/a.out:_start+0x7c >>>>>>>> [tyr:29164] *** End of error message *** >>>>>>>> ... >>>>>>>> >>>>>>>> >>>>>>>> I get the following output if I run the program in "dbx". >>>>>>>> >>>>>>>> ... >>>>>>>> RTC: Enabling Error Checking... >>>>>>>> RTC: Running program... >>>>>>>> Write to unallocated (wua) on thread 1: >>>>>>>> Attempting to write 1 byte at address 0xffffffff79f04000 >>>>>>>> t@1 (l@1) stopped in _readdir at 0xffffffff55174da0 >>>>>>>> 0xffffffff55174da0: _readdir+0x0064: call >>>> _PROCEDURE_LINKAGE_TABLE_+0x2380 [PLT] ! 0xffffffff55342a80 >>>>>>>> (dbx) >>>>>>>> >>>>>>>> >>>>>>>> Hopefully the above output helps to fix the error. Can I provide >>>>>>>> anything else? Thank you very much for any help in advance. >>>>>>>> >>>>>>>> >>>>>>>> Kind regards >>>>>>>> >>>>>>>> Siegmar > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/08/15546.php
Index: orte/util/nidmap.c =================================================================== --- orte/util/nidmap.c (revision 32449) +++ orte/util/nidmap.c (working copy) @@ -13,7 +13,6 @@ * Copyright (c) 2012-2014 Los Alamos National Security, LLC. * All rights reserved. * Copyright (c) 2013 Intel, Inc. All rights reserved - * * Copyright (c) 2014 Research Organization for Information Science * and Technology (RIST). All rights reserved. * $COPYRIGHT$ @@ -171,7 +170,9 @@ int rc; struct hostent *h; opal_buffer_t buf; - orte_process_name_t proc; + /* FIXME make sure the orte_process_name_t is 8 bytes aligned */ + opal_identifier_t _proc; + orte_process_name_t *proc = (orte_process_name_t *)&_proc; char *uri, *addr; char *proc_name; @@ -192,15 +193,15 @@ */ /* install the entry for the HNP */ - proc.jobid = ORTE_PROC_MY_NAME->jobid; - proc.vpid = 0; - if (ORTE_SUCCESS != (rc = opal_db.store((opal_identifier_t*)&proc, OPAL_SCOPE_INTERNAL, - ORTE_DB_DAEMON_VPID, &proc.vpid, OPAL_UINT32))) { + proc->jobid = ORTE_PROC_MY_NAME->jobid; + proc->vpid = 0; + if (ORTE_SUCCESS != (rc = opal_db.store((opal_identifier_t*)proc, OPAL_SCOPE_INTERNAL, + ORTE_DB_DAEMON_VPID, &proc->vpid, OPAL_UINT32))) { ORTE_ERROR_LOG(rc); return rc; } addr = "HNP"; - if (ORTE_SUCCESS != (rc = opal_db.store((opal_identifier_t*)&proc, OPAL_SCOPE_INTERNAL, + if (ORTE_SUCCESS != (rc = opal_db.store((opal_identifier_t*)proc, OPAL_SCOPE_INTERNAL, ORTE_DB_HOSTNAME, addr, OPAL_STRING))) { ORTE_ERROR_LOG(rc); return rc; @@ -213,9 +214,9 @@ OBJ_CONSTRUCT(&buf, opal_buffer_t); for (i=0; i < num_nodes; i++) { /* define the vpid for this daemon */ - proc.vpid = i+1; + proc->vpid = i+1; /* store the hostname for the proc */ - if (ORTE_SUCCESS != (rc = opal_db.store((opal_identifier_t*)&proc, OPAL_SCOPE_INTERNAL, + if (ORTE_SUCCESS != (rc = opal_db.store((opal_identifier_t*)proc, OPAL_SCOPE_INTERNAL, ORTE_DB_HOSTNAME, nodes[i], OPAL_STRING))) { ORTE_ERROR_LOG(rc); return rc; @@ -223,7 +224,7 @@ /* the arch defaults to our arch so that non-hetero * case will yield correct behavior */ - if (ORTE_SUCCESS != (rc = opal_db.store((opal_identifier_t*)&proc, OPAL_SCOPE_INTERNAL, + if (ORTE_SUCCESS != (rc = opal_db.store((opal_identifier_t*)proc, OPAL_SCOPE_INTERNAL, ORTE_DB_ARCH, &opal_local_arch, OPAL_UINT32))) { ORTE_ERROR_LOG(rc); return rc; @@ -244,7 +245,7 @@ */ /* construct the URI */ - orte_util_convert_process_name_to_string(&proc_name, &proc); + orte_util_convert_process_name_to_string(&proc_name, proc); asprintf(&uri, "%s;tcp://%s:%d", proc_name, addr, (int)orte_process_info.my_port); OPAL_OUTPUT_VERBOSE((2, orte_nidmap_output, "%s orte:util:build:daemon:nidmap node %s daemon %d addr %s uri %s", @@ -392,7 +393,9 @@ { int n; orte_vpid_t num_daemons; - orte_process_name_t daemon; + /* FIXME make sure the orte_process_name_t is 8 bytes aligned */ + opal_identifier_t _daemon; + orte_process_name_t *daemon = (orte_process_name_t *)&_daemon; opal_buffer_t buf; int rc=ORTE_SUCCESS; uint8_t oversub; @@ -432,10 +435,10 @@ } /* set the daemon jobid */ - daemon.jobid = ORTE_DAEMON_JOBID(ORTE_PROC_MY_NAME->jobid); + daemon->jobid = ORTE_DAEMON_JOBID(ORTE_PROC_MY_NAME->jobid); n=1; - while (OPAL_SUCCESS == (rc = opal_dss.unpack(&buf, &daemon.vpid, &n, ORTE_VPID))) { + while (OPAL_SUCCESS == (rc = opal_dss.unpack(&buf, &daemon->vpid, &n, ORTE_VPID))) { /* unpack and store the node's name */ n=1; if (ORTE_SUCCESS != (rc = opal_dss.unpack(&buf, &nodename, &n, OPAL_STRING))) { @@ -443,7 +446,7 @@ return rc; } /* we only need the hostname for our own error messages, so mark it as internal */ - if (ORTE_SUCCESS != (rc = opal_db.store((opal_identifier_t*)&daemon, OPAL_SCOPE_INTERNAL, + if (ORTE_SUCCESS != (rc = opal_db.store((opal_identifier_t*)daemon, OPAL_SCOPE_INTERNAL, ORTE_DB_HOSTNAME, nodename, OPAL_STRING))) { ORTE_ERROR_LOG(rc); return rc; @@ -452,9 +455,9 @@ opal_output_verbose(2, orte_nidmap_output, "%s storing nodename %s for daemon %s", ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), - nodename, ORTE_VPID_PRINT(daemon.vpid)); + nodename, ORTE_VPID_PRINT(daemon->vpid)); if (ORTE_SUCCESS != (rc = opal_db.store((opal_identifier_t*)ORTE_NAME_WILDCARD, OPAL_SCOPE_INTERNAL, - nodename, &daemon.vpid, OPAL_UINT32))) { + nodename, &daemon->vpid, OPAL_UINT32))) { ORTE_ERROR_LOG(rc); return rc; } @@ -462,10 +465,10 @@ OPAL_OUTPUT_VERBOSE((2, orte_nidmap_output, "%s orte:util:decode:nidmap daemon %s node %s", ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), - ORTE_VPID_PRINT(daemon.vpid), nodename)); + ORTE_VPID_PRINT(daemon->vpid), nodename)); /* if this is my daemon, then store the data for me too */ - if (daemon.vpid == ORTE_PROC_MY_DAEMON->vpid) { + if (daemon->vpid == ORTE_PROC_MY_DAEMON->vpid) { if (ORTE_SUCCESS != (rc = opal_db.store((opal_identifier_t*)ORTE_PROC_MY_NAME, OPAL_SCOPE_NON_PEER, ORTE_DB_HOSTNAME, nodename, OPAL_STRING))) { ORTE_ERROR_LOG(rc); @@ -473,7 +476,7 @@ } /* we may need our daemon vpid to be shared with non-peers */ if (ORTE_SUCCESS != (rc = opal_db.store((opal_identifier_t*)ORTE_PROC_MY_NAME, OPAL_SCOPE_NON_PEER, - ORTE_DB_DAEMON_VPID, &daemon.vpid, OPAL_UINT32))) { + ORTE_DB_DAEMON_VPID, &daemon->vpid, OPAL_UINT32))) { ORTE_ERROR_LOG(rc); return rc; } @@ -498,9 +501,9 @@ opal_output_verbose(2, orte_nidmap_output, "%s storing alias %s for daemon %s", ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), - alias, ORTE_VPID_PRINT(daemon.vpid)); + alias, ORTE_VPID_PRINT(daemon->vpid)); if (ORTE_SUCCESS != (rc = opal_db.store((opal_identifier_t*)ORTE_NAME_WILDCARD, OPAL_SCOPE_INTERNAL, - alias, &daemon.vpid, OPAL_UINT32))) { + alias, &daemon->vpid, OPAL_UINT32))) { ORTE_ERROR_LOG(rc); return rc; } @@ -524,13 +527,13 @@ ORTE_ERROR_LOG(rc); return rc; } - if (ORTE_SUCCESS != (rc = opal_db.store((opal_identifier_t*)&daemon, OPAL_SCOPE_NON_PEER, + if (ORTE_SUCCESS != (rc = opal_db.store((opal_identifier_t*)daemon, OPAL_SCOPE_NON_PEER, ORTE_DB_HOSTID, &hostid, OPAL_UINT32))) { ORTE_ERROR_LOG(rc); return rc; } /* if this is my daemon, then store it as my hostid as well */ - if (daemon.vpid == ORTE_PROC_MY_DAEMON->vpid) { + if (daemon->vpid == ORTE_PROC_MY_DAEMON->vpid) { if (ORTE_SUCCESS != (rc = opal_db.store((opal_identifier_t*)ORTE_PROC_MY_NAME, OPAL_SCOPE_NON_PEER, ORTE_DB_HOSTID, &hostid, OPAL_UINT32))) { ORTE_ERROR_LOG(rc); @@ -885,7 +888,10 @@ orte_proc_state_t state; orte_app_idx_t app_idx; int32_t restarts; - orte_process_name_t proc, dmn; + /* FIXME make sure the orte_process_name_t is 8 bytes aligned */ + opal_identifier_t _proc, _dmn; + orte_process_name_t *proc = (orte_process_name_t *)&_proc; + orte_process_name_t *dmn = (orte_process_name_t *)&_dmn; char *hostname; uint8_t flag; opal_buffer_t *bptr; @@ -899,16 +905,16 @@ } /* set the daemon jobid */ - dmn.jobid = ORTE_DAEMON_JOBID(ORTE_PROC_MY_NAME->jobid); + dmn->jobid = ORTE_DAEMON_JOBID(ORTE_PROC_MY_NAME->jobid); n = 1; /* cycle through the buffer */ orte_process_info.num_local_peers = 0; - while (ORTE_SUCCESS == (rc = opal_dss.unpack(&buf, &proc.jobid, &n, ORTE_JOBID))) { + while (ORTE_SUCCESS == (rc = opal_dss.unpack(&buf, &proc->jobid, &n, ORTE_JOBID))) { OPAL_OUTPUT_VERBOSE((2, orte_nidmap_output, "%s orte:util:decode:pidmap working job %s", ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), - ORTE_JOBID_PRINT(proc.jobid))); + ORTE_JOBID_PRINT(proc->jobid))); /* unpack and store the number of procs */ n=1; @@ -916,9 +922,9 @@ ORTE_ERROR_LOG(rc); goto cleanup; } - proc.vpid = ORTE_VPID_INVALID; + proc->vpid = ORTE_VPID_INVALID; /* only useful to ourselves */ - if (ORTE_SUCCESS != (rc = opal_db.store((opal_identifier_t*)&proc, OPAL_SCOPE_INTERNAL, + if (ORTE_SUCCESS != (rc = opal_db.store((opal_identifier_t*)proc, OPAL_SCOPE_INTERNAL, ORTE_DB_NPROCS, &num_procs, OPAL_UINT32))) { ORTE_ERROR_LOG(rc); goto cleanup; @@ -930,7 +936,7 @@ goto cleanup; } /* only of possible use to ourselves */ - if (ORTE_SUCCESS != (rc = opal_db.store((opal_identifier_t*)&proc, OPAL_SCOPE_INTERNAL, + if (ORTE_SUCCESS != (rc = opal_db.store((opal_identifier_t*)proc, OPAL_SCOPE_INTERNAL, ORTE_DB_NPROC_OFFSET, &offset, OPAL_UINT32))) { ORTE_ERROR_LOG(rc); goto cleanup; @@ -939,12 +945,12 @@ * all data for this job has been read */ n=1; - while (OPAL_SUCCESS == (rc = opal_dss.unpack(&buf, &proc.vpid, &n, ORTE_VPID))) { - if (ORTE_VPID_INVALID == proc.vpid) { + while (OPAL_SUCCESS == (rc = opal_dss.unpack(&buf, &proc->vpid, &n, ORTE_VPID))) { + if (ORTE_VPID_INVALID == proc->vpid) { break; } n=1; - if (ORTE_SUCCESS != (rc = opal_dss.unpack(&buf, &dmn.vpid, &n, ORTE_VPID))) { + if (ORTE_SUCCESS != (rc = opal_dss.unpack(&buf, &dmn->vpid, &n, ORTE_VPID))) { ORTE_ERROR_LOG(rc); goto cleanup; } @@ -965,15 +971,15 @@ goto cleanup; } #endif - if (proc.jobid == ORTE_PROC_MY_NAME->jobid && - proc.vpid == ORTE_PROC_MY_NAME->vpid) { + if (proc->jobid == ORTE_PROC_MY_NAME->jobid && + proc->vpid == ORTE_PROC_MY_NAME->vpid) { /* set mine */ orte_process_info.my_local_rank = local_rank; orte_process_info.my_node_rank = node_rank; /* if we are the local leader (i.e., local_rank=0), then record it */ if (0 == local_rank) { if (ORTE_SUCCESS != (rc = opal_db.store((opal_identifier_t*)ORTE_PROC_MY_NAME, OPAL_SCOPE_INTERNAL, - OPAL_DB_LOCALLDR, (opal_identifier_t*)&proc, OPAL_ID_T))) { + OPAL_DB_LOCALLDR, (opal_identifier_t*)proc, OPAL_ID_T))) { ORTE_ERROR_LOG(rc); goto cleanup; } @@ -983,14 +989,14 @@ orte_process_info.cpuset = strdup(cpu_bitmap); } #endif - } else if (proc.jobid == ORTE_PROC_MY_NAME->jobid && - dmn.vpid == ORTE_PROC_MY_DAEMON->vpid) { + } else if (proc->jobid == ORTE_PROC_MY_NAME->jobid && + dmn->vpid == ORTE_PROC_MY_DAEMON->vpid) { /* if we share a daemon, then add to my local peers */ orte_process_info.num_local_peers++; /* if this is the local leader (i.e., local_rank=0), then record it */ if (0 == local_rank) { if (ORTE_SUCCESS != (rc = opal_db.store((opal_identifier_t*)ORTE_PROC_MY_NAME, OPAL_SCOPE_INTERNAL, - OPAL_DB_LOCALLDR, (opal_identifier_t*)&proc, OPAL_ID_T))) { + OPAL_DB_LOCALLDR, (opal_identifier_t*)proc, OPAL_ID_T))) { ORTE_ERROR_LOG(rc); goto cleanup; } @@ -1020,18 +1026,18 @@ goto cleanup; } /* store the values in the database - again, these are for our own internal use */ - if (ORTE_SUCCESS != (rc = opal_db.store((opal_identifier_t*)&proc, OPAL_SCOPE_INTERNAL, + if (ORTE_SUCCESS != (rc = opal_db.store((opal_identifier_t*)proc, OPAL_SCOPE_INTERNAL, ORTE_DB_LOCALRANK, &local_rank, ORTE_LOCAL_RANK))) { ORTE_ERROR_LOG(rc); goto cleanup; } - if (ORTE_SUCCESS != (rc = opal_db.store((opal_identifier_t*)&proc, OPAL_SCOPE_INTERNAL, + if (ORTE_SUCCESS != (rc = opal_db.store((opal_identifier_t*)proc, OPAL_SCOPE_INTERNAL, ORTE_DB_NODERANK, &node_rank, ORTE_NODE_RANK))) { ORTE_ERROR_LOG(rc); goto cleanup; } #if OPAL_HAVE_HWLOC - if (ORTE_SUCCESS != (rc = opal_db.store((opal_identifier_t*)&proc, OPAL_SCOPE_INTERNAL, + if (ORTE_SUCCESS != (rc = opal_db.store((opal_identifier_t*)proc, OPAL_SCOPE_INTERNAL, OPAL_DB_CPUSET, cpu_bitmap, OPAL_STRING))) { ORTE_ERROR_LOG(rc); goto cleanup; @@ -1044,25 +1050,25 @@ * for ourself in the database * as we already did so during startup */ - if (proc.jobid != ORTE_PROC_MY_NAME->jobid || - proc.vpid != ORTE_PROC_MY_NAME->vpid) { + if (proc->jobid != ORTE_PROC_MY_NAME->jobid || + proc->vpid != ORTE_PROC_MY_NAME->vpid) { /* store the data for this proc - the location of a proc is something * we would potentially need to share with a non-peer */ - if (ORTE_SUCCESS != (rc = opal_db.store((opal_identifier_t*)&proc, OPAL_SCOPE_NON_PEER, - ORTE_DB_DAEMON_VPID, &dmn.vpid, OPAL_UINT32))) { + if (ORTE_SUCCESS != (rc = opal_db.store((opal_identifier_t*)proc, OPAL_SCOPE_NON_PEER, + ORTE_DB_DAEMON_VPID, &dmn->vpid, OPAL_UINT32))) { ORTE_ERROR_LOG(rc); goto cleanup; } /* in a singleton comm_spawn, we can be passed the name of a daemon, which * means that the proc's parent is invalid - check and avoid the rest of * this logic in that case */ - if (ORTE_VPID_INVALID != dmn.vpid) { + if (ORTE_VPID_INVALID != dmn->vpid) { /* if coprocessors were detected, lookup and store the hostid for this proc */ if (orte_coprocessors_detected) { /* lookup the hostid for this daemon */ vptr = &hostid; - if (ORTE_SUCCESS != (rc = opal_db.fetch((opal_identifier_t*)&dmn, ORTE_DB_HOSTID, + if (ORTE_SUCCESS != (rc = opal_db.fetch((opal_identifier_t*)dmn, ORTE_DB_HOSTID, (void**)&vptr, OPAL_UINT32))) { ORTE_ERROR_LOG(rc); goto cleanup; @@ -1070,29 +1076,29 @@ OPAL_OUTPUT_VERBOSE((2, orte_nidmap_output, "%s FOUND HOSTID %s FOR DAEMON %s", ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), - ORTE_VPID_PRINT(hostid), ORTE_VPID_PRINT(dmn.vpid))); + ORTE_VPID_PRINT(hostid), ORTE_VPID_PRINT(dmn->vpid))); /* store it as hostid for this proc */ - if (ORTE_SUCCESS != (rc = opal_db.store((opal_identifier_t*)&proc, OPAL_SCOPE_NON_PEER, + if (ORTE_SUCCESS != (rc = opal_db.store((opal_identifier_t*)proc, OPAL_SCOPE_NON_PEER, ORTE_DB_HOSTID, &hostid, OPAL_UINT32))) { ORTE_ERROR_LOG(rc); goto cleanup; } } /* lookup and store the hostname for this proc */ - if (ORTE_SUCCESS != (rc = opal_db.fetch_pointer((opal_identifier_t*)&dmn, ORTE_DB_HOSTNAME, + if (ORTE_SUCCESS != (rc = opal_db.fetch_pointer((opal_identifier_t*)dmn, ORTE_DB_HOSTNAME, (void**)&hostname, OPAL_STRING))) { ORTE_ERROR_LOG(rc); goto cleanup; } - if (ORTE_SUCCESS != (rc = opal_db.store((opal_identifier_t*)&proc, OPAL_SCOPE_NON_PEER, + if (ORTE_SUCCESS != (rc = opal_db.store((opal_identifier_t*)proc, OPAL_SCOPE_NON_PEER, ORTE_DB_HOSTNAME, hostname, OPAL_STRING))) { ORTE_ERROR_LOG(rc); goto cleanup; } } /* store this procs global rank - only used by us */ - global_rank = proc.vpid + offset; - if (ORTE_SUCCESS != (rc = opal_db.store((opal_identifier_t*)&proc, OPAL_SCOPE_INTERNAL, + global_rank = proc->vpid + offset; + if (ORTE_SUCCESS != (rc = opal_db.store((opal_identifier_t*)proc, OPAL_SCOPE_INTERNAL, ORTE_DB_GLOBAL_RANK, &global_rank, OPAL_UINT32))) { ORTE_ERROR_LOG(rc); goto cleanup; @@ -1101,8 +1107,8 @@ /* update our own global rank - this is something we will need * to share with non-peers */ - global_rank = proc.vpid + offset; - if (ORTE_SUCCESS != (rc = opal_db.store((opal_identifier_t*)&proc, OPAL_SCOPE_NON_PEER, + global_rank = proc->vpid + offset; + if (ORTE_SUCCESS != (rc = opal_db.store((opal_identifier_t*)proc, OPAL_SCOPE_NON_PEER, ORTE_DB_GLOBAL_RANK, &global_rank, OPAL_UINT32))) { ORTE_ERROR_LOG(rc); goto cleanup;