Hi, > Your commit r32459 fixed the bus error by correcting > opal/dss/dss_copy.c. It's OK for trunk because mca_dstore_hash > calls dss to copy data. But it's insufficient for v1.8 because > mca_db_hash doesn't call dss and copies data itself. > > The attached patch is the minimum patch to fix it in v1.8. > My fix doesn't call dss but uses memcpy. I have confirmed it on > SPARC64/Linux.
Thank you very much for your help. I applied your patch and it fixes the bus error for my C programs as well. Unfortunately I get a SIGSEGV for Java programs. tyr java 126 mpiexec -np 1 java InitFinalizeMain # # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0xffffffff7ea3c7f0, pid=10506, tid=2 ... gdb shows the following backtrace. tyr java 127 /usr/local/gdb-7.6.1_64_gcc/bin/gdb /usr/local/openmpi-1.8.2_64_gcc/bin/mpiexec GNU gdb (GDB) 7.6.1 ... (gdb) run -np 1 java InitFinalizeMain Starting program: /usr/local/openmpi-1.8.2_64_gcc/bin/mpiexec -np 1 java InitFinalizeMain [Thread debugging using libthread_db enabled] [New Thread 1 (LWP 1)] [New LWP 2 ] # # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0xffffffff7ea3c7f0, pid=10524, tid=2 # # JRE version: Java(TM) SE Runtime Environment (8.0-b132) (build 1.8.0-b132) # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.0-b70 mixed mode solaris-sparc compressed oops) # Problematic frame: # C [libc.so.1+0x3c7f0] strlen+0x50 # # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again # # An error report file with more information is saved as: # /home/fd1026/work/skripte/master/parallel/prog/mpi/java/hs_err_pid10524.log # # If you would like to submit a bug report, please visit: # http://bugreport.sun.com/bugreport/crash.jsp # The crash happened outside the Java Virtual Machine in native code. # See problematic frame for where to report the bug. # -------------------------------------------------------------------------- mpiexec noticed that process rank 0 with PID 10524 on node tyr exited on signal 6 (Abort). -------------------------------------------------------------------------- [LWP 2 exited] [New Thread 2 ] [Switching to Thread 1 (LWP 1)] sol_thread_fetch_registers: td_ta_map_id2thr: no thread can be found to satisfy query (gdb) bt #0 0xffffffff7f6173d0 in rtld_db_dlactivity () from /usr/lib/sparcv9/ld.so.1 #1 0xffffffff7f6175a8 in rd_event () from /usr/lib/sparcv9/ld.so.1 #2 0xffffffff7f618950 in lm_delete () from /usr/lib/sparcv9/ld.so.1 #3 0xffffffff7f6226bc in remove_so () from /usr/lib/sparcv9/ld.so.1 #4 0xffffffff7f624574 in remove_hdl () from /usr/lib/sparcv9/ld.so.1 #5 0xffffffff7f61d97c in dlclose_core () from /usr/lib/sparcv9/ld.so.1 #6 0xffffffff7f61d9d4 in dlclose_intn () from /usr/lib/sparcv9/ld.so.1 #7 0xffffffff7f61db0c in dlclose () from /usr/lib/sparcv9/ld.so.1 #8 0xffffffff7ec7748c in vm_close () from /usr/local/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6 #9 0xffffffff7ec74a6c in lt_dlclose () from /usr/local/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6 #10 0xffffffff7ec99b90 in ri_destructor (obj=0x1001eae10) at ../../../../openmpi-1.8.2rc4r32485/opal/mca/base/mca_base_component_repository.c:391 #11 0xffffffff7ec984a8 in opal_obj_run_destructors (object=0x1001eae10) at ../../../../openmpi-1.8.2rc4r32485/opal/class/opal_object.h:446 #12 0xffffffff7ec9940c in mca_base_component_repository_release ( component=0xffffffff7b023df0 <mca_oob_tcp_component>) at ../../../../openmpi-1.8.2rc4r32485/opal/mca/base/mca_base_component_repository.c:244 #13 0xffffffff7ec9b754 in mca_base_component_unload ( component=0xffffffff7b023df0 <mca_oob_tcp_component>, output_id=-1) at ../../../../openmpi-1.8.2rc4r32485/opal/mca/base/mca_base_components_close.c:47 #14 0xffffffff7ec9b7e8 in mca_base_component_close ( component=0xffffffff7b023df0 <mca_oob_tcp_component>, output_id=-1) at ../../../../openmpi-1.8.2rc4r32485/opal/mca/base/mca_base_components_close.c:60 #15 0xffffffff7ec9b8bc in mca_base_components_close (output_id=-1, components=0xffffffff7f12b930 <orte_oob_base_framework+80>, skip=0x0) at ../../../../openmpi-1.8.2rc4r32485/opal/mca/base/mca_base_components_close.c:86 #16 0xffffffff7ec9b824 in mca_base_framework_components_close ( framework=0xffffffff7f12b8e0 <orte_oob_base_framework>, skip=0x0) at ../../../../openmpi-1.8.2rc4r32485/opal/mca/base/mca_base_components_close.c:66 #17 0xffffffff7efae21c in orte_oob_base_close () at ../../../../openmpi-1.8.2rc4r32485/orte/mca/oob/base/oob_base_frame.c:94 #18 0xffffffff7ecb28cc in mca_base_framework_close ( framework=0xffffffff7f12b8e0 <orte_oob_base_framework>) ---Type <return> to continue, or q <return> to quit--- at ../../../../openmpi-1.8.2rc4r32485/opal/mca/base/mca_base_framework.c:187 #19 0xffffffff7bf078c0 in rte_finalize () at ../../../../../openmpi-1.8.2rc4r32485/orte/mca/ess/hnp/ess_hnp_module.c:858 #20 0xffffffff7ef30a44 in orte_finalize () at ../../openmpi-1.8.2rc4r32485/orte/runtime/orte_finalize.c:65 #21 0x00000001000070c4 in orterun (argc=5, argv=0xffffffff7fffe0d8) at ../../../../openmpi-1.8.2rc4r32485/orte/tools/orterun/orterun.c:1096 #22 0x0000000100003d70 in main (argc=5, argv=0xffffffff7fffe0d8) at ../../../../openmpi-1.8.2rc4r32485/orte/tools/orterun/main.c:13 (gdb) Kind regards and once more thank you very much Siegmar > Sorry to response so late. > > Regards, > Takahiro Kawashima, > MPI development team, > Fujitsu > > > Siegmar, Ralph, > > > > I'm sorry to response so late since last week. > > > > Ralph fixed the problem in r32459 and it was merged to v1.8 > > in r32474. But in v1.8 an additional custom patch is needed > > because the db/dstore source codes are different between trunk > > and v1.8. > > > > I'm preparing and testing the custom patch just now. > > Wait wait a minute please. > > > > Takahiro Kawashima, > > MPI development team, > > Fujitsu > > > > > Hi, > > > > > > thank you very much to everybody who tried to solve my bus > > > error problem on Solaris 10 Sparc. I thought that you found > > > and fixed it, so that I installed openmpi-1.8.2rc4r32485 on > > > my machines (Solaris 10 Sparc (tyr), Solaris 10 x86_64 (sunpc1), > > > openSUSE Linux 12.1 x86_64 (linpc1)) with gcc-4.9.0. A small > > > program works on my x86_64 architectures, but still breaks > > > with a bus error on my Sparc system. > > > > > > linpc1 fd1026 106 mpiexec -np 1 init_finalize > > > Hello! > > > linpc1 fd1026 106 exit > > > logout > > > tyr small_prog 113 ssh sunpc1 > > > sunpc1 fd1026 101 mpiexec -np 1 init_finalize > > > Hello! > > > sunpc1 fd1026 102 exit > > > logout > > > tyr small_prog 114 mpiexec -np 1 init_finalize > > > [tyr:21109] *** Process received signal *** > > > [tyr:21109] Signal: Bus Error (10) > > > ... > > > > > > > > > gdb shows the following backtrace. > > > > > > tyr small_prog 122 /usr/local/gdb-7.6.1_64_gcc/bin/gdb > > > /usr/local/openmpi-1.8.2_64_gcc/bin/mpiexec > > > GNU gdb (GDB) 7.6.1 > > > ... > > > (gdb) run -np 1 init_finalize > > > Starting program: /usr/local/openmpi-1.8.2_64_gcc/bin/mpiexec -np 1 > > > init_finalize > > > [Thread debugging using libthread_db enabled] > > > [New Thread 1 (LWP 1)] > > > [New LWP 2 ] > > > [tyr:21158] *** Process received signal *** > > > [tyr:21158] Signal: Bus Error (10) > > > [tyr:21158] Signal code: Invalid address alignment (1) > > > [tyr:21158] Failing at address: ffffffff7fffd224 > > > /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:opal_backtrace_print+0x2c > > > /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:0xcd130 > > > /lib/sparcv9/libc.so.1:0xd8b98 > > > /lib/sparcv9/libc.so.1:0xcc70c > > > /lib/sparcv9/libc.so.1:0xcc918 > > > /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/openmpi/mca_db_hash.so:0x3ee8 > > > [ Signal 10 (BUS)] > > > /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:opal_db_base_store+0xc8 > > > /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_util_decode_pidmap+0x798 > > > /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_util_nidmap_init+0x3cc > > > /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/openmpi/mca_ess_env.so:0x226c > > > /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_init+0x308 > > > /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libmpi.so.1.5.2:ompi_mpi_init+0x31c > > > /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libmpi.so.1.5.2:MPI_Init+0x2a8 > > > /home/fd1026/SunOS/sparc/bin/init_finalize:main+0x10 > > > /home/fd1026/SunOS/sparc/bin/init_finalize:_start+0x7c > > > [tyr:21158] *** End of error message *** > > > -------------------------------------------------------------------------- > > > mpiexec noticed that process rank 0 with PID 21158 on node tyr exited on > > > signal 10 (Bus Error). > > > -------------------------------------------------------------------------- > > > [LWP 2 exited] > > > [New Thread 2 ] > > > [Switching to Thread 1 (LWP 1)] > > > sol_thread_fetch_registers: td_ta_map_id2thr: no thread can be found to > > > satisfy query > > > (gdb) bt > > > #0 0xffffffff7f6173d0 in rtld_db_dlactivity () from > > > /usr/lib/sparcv9/ld.so.1 > > > #1 0xffffffff7f6175a8 in rd_event () from /usr/lib/sparcv9/ld.so.1 > > > #2 0xffffffff7f618950 in lm_delete () from /usr/lib/sparcv9/ld.so.1 > > > #3 0xffffffff7f6226bc in remove_so () from /usr/lib/sparcv9/ld.so.1 > > > #4 0xffffffff7f624574 in remove_hdl () from /usr/lib/sparcv9/ld.so.1 > > > #5 0xffffffff7f61d97c in dlclose_core () from /usr/lib/sparcv9/ld.so.1 > > > #6 0xffffffff7f61d9d4 in dlclose_intn () from /usr/lib/sparcv9/ld.so.1 > > > #7 0xffffffff7f61db0c in dlclose () from /usr/lib/sparcv9/ld.so.1 > > > #8 0xffffffff7ec7748c in vm_close () from > > > /usr/local/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6 > > > #9 0xffffffff7ec74a6c in lt_dlclose () from > > > /usr/local/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6 > > > #10 0xffffffff7ec99b90 in ri_destructor (obj=0x1001ead30) > > > at > > > ../../../../openmpi-1.8.2rc4r32485/opal/mca/base/mca_base_component_repository.c:391 > > > #11 0xffffffff7ec984a8 in opal_obj_run_destructors (object=0x1001ead30) > > > at ../../../../openmpi-1.8.2rc4r32485/opal/class/opal_object.h:446 > > > #12 0xffffffff7ec9940c in mca_base_component_repository_release ( > > > component=0xffffffff7b023df0 <mca_oob_tcp_component>) > > > at > > > ../../../../openmpi-1.8.2rc4r32485/opal/mca/base/mca_base_component_repository.c:244 > > > #13 0xffffffff7ec9b754 in mca_base_component_unload ( > > > component=0xffffffff7b023df0 <mca_oob_tcp_component>, output_id=-1) > > > at > > > ../../../../openmpi-1.8.2rc4r32485/opal/mca/base/mca_base_components_close.c:47 > > > #14 0xffffffff7ec9b7e8 in mca_base_component_close ( > > > component=0xffffffff7b023df0 <mca_oob_tcp_component>, output_id=-1) > > > at > > > ../../../../openmpi-1.8.2rc4r32485/opal/mca/base/mca_base_components_close.c:60 > > > #15 0xffffffff7ec9b8bc in mca_base_components_close (output_id=-1, > > > components=0xffffffff7f12b930 <orte_oob_base_framework+80>, skip=0x0) > > > at > > > ../../../../openmpi-1.8.2rc4r32485/opal/mca/base/mca_base_components_close.c:86 > > > #16 0xffffffff7ec9b824 in mca_base_framework_components_close ( > > > framework=0xffffffff7f12b8e0 <orte_oob_base_framework>, skip=0x0) > > > at > > > ../../../../openmpi-1.8.2rc4r32485/opal/mca/base/mca_base_components_close.c:66 > > > #17 0xffffffff7efae21c in orte_oob_base_close () > > > at > > > ../../../../openmpi-1.8.2rc4r32485/orte/mca/oob/base/oob_base_frame.c:94 > > > #18 0xffffffff7ecb28cc in mca_base_framework_close ( > > > framework=0xffffffff7f12b8e0 <orte_oob_base_framework>) > > > at > > > ../../../../openmpi-1.8.2rc4r32485/opal/mca/base/mca_base_framework.c:187 > > > #19 0xffffffff7bf078c0 in rte_finalize () > > > at > > > ../../../../../openmpi-1.8.2rc4r32485/orte/mca/ess/hnp/ess_hnp_module.c:858 > > > #20 0xffffffff7ef30a44 in orte_finalize () > > > at ../../openmpi-1.8.2rc4r32485/orte/runtime/orte_finalize.c:65 > > > #21 0x00000001000070c4 in orterun (argc=4, argv=0xffffffff7fffe0d8) > > > at > > > ../../../../openmpi-1.8.2rc4r32485/orte/tools/orterun/orterun.c:1096 > > > #22 0x0000000100003d70 in main (argc=4, argv=0xffffffff7fffe0d8) > > > at ../../../../openmpi-1.8.2rc4r32485/orte/tools/orterun/main.c:13 > > > (gdb) > > > > > > > > > Is this a new problem? I would be grateful if somebody could > > > fix it. Thank you very much for any help in advance. > > > > > > Kind regards > > > > > > Siegmar