Hi, at first thank you very much for your help.
1st patch: > Can you apply the following patch to a trunk tarball and see if it works > for you? 2nd patch: > Found the problem. Was accessing a boolean variable using intval. That > is a bug that has gone unnoticed on all platforms but thankfully Solaris > caught it. > > Please try the attached patch. I applied both patches manually to openmpi-1.9a1r29972, because my patch program couldn't use the patches. Unfortunately I still get a Bus Error. Hopefully I didn't make a mistake applying your patches. Therefore I show you a "diff" for my files. By the way, I tried to apply your patches with "patch -b -i <your file>". Is it necessary to use a different command? tyr openmpi-1.9a1r29972 161 ls -l opal/mca/base/mca_base_var.c* -rw-r--r-- 1 fd1026 inf 60418 Dec 19 08:35 opal/mca/base/mca_base_var.c -rw-r--r-- 1 fd1026 inf 60236 Dec 19 03:05 opal/mca/base/mca_base_var.c.orig tyr openmpi-1.9a1r29972 162 diff opal/mca/base/mca_base_var.c* 1685,1689c1685 < if (MCA_BASE_VAR_TYPE_BOOL == var->mbv_type) { < ret = var->mbv_enumerator->string_from_value(var->mbv_enumerator, value->boolval, &tmp); < } else { < ret = var->mbv_enumerator->string_from_value(var->mbv_enumerator, value->intval, &tmp); < } --- > ret = var->mbv_enumerator->string_from_value(var->mbv_enumerator, value->intval, &tmp); tyr openmpi-1.9a1r29972 163 tyr openmpi-1.9a1r29972 165 ls -l opal/util/net.c* -rw-r--r-- 1 fd1026 inf 12922 Dec 19 07:55 opal/util/net.c -rw-r--r-- 1 fd1026 inf 12675 Dec 19 03:05 opal/util/net.c.orig tyr openmpi-1.9a1r29972 166 diff opal/util/net.c* 267,271c267,268 < struct sockaddr_in inaddr1, inaddr2; < /* Use temporary variables and memcpy's so that we don't < run into bus errors on Solaris/SPARC */ < memcpy(&inaddr1, addr1, sizeof(inaddr1)); < memcpy(&inaddr2, addr2, sizeof(inaddr2)); --- > const struct sockaddr_in *inaddr1 = (struct sockaddr_in*) addr1; > const struct sockaddr_in *inaddr2 = (struct sockaddr_in*) addr2; 274,275c271,272 < if((inaddr1.sin_addr.s_addr & netmask) == < (inaddr2.sin_addr.s_addr & netmask)) { --- > if((inaddr1->sin_addr.s_addr & netmask) == > (inaddr2->sin_addr.s_addr & netmask)) { 284,290c281,284 < struct sockaddr_in6 inaddr1, inaddr2; < /* Use temporary variables and memcpy's so that we don't < run into bus errors on Solaris/SPARC */ < memcpy(&inaddr1, addr1, sizeof(inaddr1)); < memcpy(&inaddr2, addr2, sizeof(inaddr2)); < struct in6_addr *a6_1 = (struct in6_addr*) &inaddr1.sin6_addr; < struct in6_addr *a6_2 = (struct in6_addr*) &inaddr2.sin6_addr; --- > const struct sockaddr_in6 *inaddr1 = (struct sockaddr_in6*) addr1; > const struct sockaddr_in6 *inaddr2 = (struct sockaddr_in6*) addr2; > struct in6_addr *a6_1 = (struct in6_addr*) &inaddr1->sin6_addr; > struct in6_addr *a6_2 = (struct in6_addr*) &inaddr2->sin6_addr; tyr openmpi-1.9a1r29972 167 Now my debug information. tyr fd1026 52 cd /usr/local/openmpi-1.9_64_cc/bin/ tyr bin 53 /opt/solstudio12.3/bin/sparcv9/dbx ompi_info For information about new features see `help changes' To remove this message, put `dbxenv suppress_startup_message 7.9' in your .dbxrc Reading ompi_info Reading ld.so.1 Reading libmpi.so.0.0.0 Reading libopen-rte.so.0.0.0 Reading libopen-pal.so.0.0.0 Reading libsendfile.so.1 Reading libpicl.so.1 Reading libkstat.so.1 Reading liblgrp.so.1 Reading libsocket.so.1 Reading libnsl.so.1 Reading librt.so.1 Reading libm.so.2 Reading libthread.so.1 Reading libc.so.1 Reading libdoor.so.1 Reading libaio.so.1 Reading libmd.so.1 (dbx) run -a Running: ompi_info -a (process id 10998) Reading libc_psr.so.1 ... MCA compress: parameter "compress_base_verbose" (current value: "-1", data source: default, level: 8 dev/detail, type: int) Verbosity level for the compress framework (0 = no verbosity) t@1 (l@1) signal BUS (invalid address alignment) in var_value_string at line 1680 in file "mca_base_var.c" 1680 ret = asprintf (value_string, var_type_formats[var->mbv_type], value[0]); (dbx) (dbx) (dbx) check -all dbx: warning: check -all will be turned on in the next run of the process access checking - OFF memuse checking - OFF (dbx) run -a Running: ompi_info -a (process id 11000) Reading rtcapihook.so Reading libdl.so.1 Reading rtcaudit.so Reading libmapmalloc.so.1 Reading rtcboot.so Reading librtc.so Reading libmd_psr.so.1 RTC: Enabling Error Checking... RTC: Using UltraSparc trap mechanism RTC: See `help rtc showmap' and `help rtc limitations' for details. RTC: Running program... Read from uninitialized (rui) on thread 1: Attempting to read 4 bytes at address 0xffffffff7fffd5f8 which is 184 bytes above the current stack pointer Variable is 'index' t@1 (l@1) stopped in var_find at line 802 in file "mca_base_var.c" 802 return (OPAL_SUCCESS != ret) ? ret : index; (dbx) In my opinion it is the same error as before. I still get a Bus Error with "make check". tyr bin 54 cd /export2/src/openmpi-1.9/openmpi-1.9a1r29972-SunOS.sparc.64_cc/test/datatype/.li bs/ tyr .libs 55 /opt/solstudio12.3/bin/sparcv9/dbx ddt_raw For information about new features see `help changes' To remove this message, put `dbxenv suppress_startup_message 7.9' in your .dbxrc Reading ddt_raw Reading ld.so.1 Reading libmpi.so.0.0.0 Reading libopen-rte.so.0.0.0 Reading libopen-pal.so.0.0.0 Reading libsendfile.so.1 Reading libpicl.so.1 Reading libkstat.so.1 Reading liblgrp.so.1 Reading libsocket.so.1 Reading libnsl.so.1 Reading librt.so.1 Reading libm.so.2 Reading libthread.so.1 Reading libc.so.1 Reading libdoor.so.1 Reading libaio.so.1 Reading libmd.so.1 (dbx) run Running: ddt_raw (process id 11018) Reading libc_psr.so.1 # * TEST INVERSED VECTOR # t@1 (l@1) signal BUS (invalid address alignment) in opal_convertor_raw at line 71 in file "opal_convertor_raw.c" 71 DO_DEBUG( opal_output( 0, "opal_convertor_raw( %p, {%p, %u}, %lu )\n", (void*)pConvertor, (dbx) Once more I think it is the same error. I have the same problem with my small program. tyr small_prog 62 mpicc init_finalize.c tyr small_prog 63 /opt/solstudio12.3/bin/sparcv9/dbx \ /usr/local/openmpi-1.9_64_cc/bin/mpiexec For information about new features see `help changes' To remove this message, put `dbxenv suppress_startup_message 7.9' in your .dbxrc Reading mpiexec Reading ld.so.1 Reading libopen-rte.so.0.0.0 Reading libopen-pal.so.0.0.0 Reading libsendfile.so.1 Reading libpicl.so.1 Reading libkstat.so.1 Reading liblgrp.so.1 Reading libsocket.so.1 Reading libnsl.so.1 Reading librt.so.1 Reading libm.so.2 Reading libthread.so.1 Reading libc.so.1 Reading libdoor.so.1 Reading libaio.so.1 Reading libmd.so.1 (dbx) (dbx) run -np 1 a.out Running: mpiexec -np 1 a.out (process id 11050) Reading libc_psr.so.1 Reading mca_shmem_mmap.so Reading libmp.so.2 Reading libscf.so.1 Reading libuutil.so.1 Reading libgen.so.1 Reading mca_shmem_posix.so Reading mca_shmem_sysv.so Reading mca_ess_env.so Reading mca_ess_hnp.so Reading mca_ess_singleton.so Reading mca_ess_tool.so Reading mca_pstat_test.so Reading mca_state_app.so Reading mca_state_hnp.so Reading mca_state_novm.so Reading mca_state_orted.so Reading mca_state_staged_hnp.so Reading mca_state_staged_orted.so Reading mca_state_tool.so Reading mca_errmgr_default_app.so Reading mca_errmgr_default_hnp.so Reading mca_errmgr_default_orted.so Reading mca_errmgr_default_tool.so Reading mca_plm_rsh.so Reading mca_oob_tcp.so Reading mca_rml_oob.so Reading mca_routed_binomial.so Reading mca_routed_debruijn.so Reading mca_routed_direct.so Reading mca_routed_radix.so Reading mca_db_hash.so Reading mca_db_print.so Reading mca_grpcomm_bad.so Reading mca_ras_simulator.so Reading mca_rmaps_lama.so Reading mca_rmaps_mindist.so Reading mca_rmaps_ppr.so Reading mca_rmaps_rank_file.so Reading mca_rmaps_resilient.so Reading mca_rmaps_round_robin.so Reading mca_rmaps_seq.so Reading mca_rmaps_staged.so Reading mca_odls_default.so Reading mca_iof_hnp.so Reading mca_iof_mr_hnp.so Reading mca_iof_mr_orted.so Reading mca_iof_orted.so Reading mca_iof_tool.so Reading mca_filem_raw.so Reading mca_dfs_app.so Reading mca_dfs_orted.so Reading mca_dfs_test.so Now the program hangs. ^Cdbx: warning: Interrupt ignored but forwarded to child. t@1 (l@1) signal INT (Interrupt) in __pollsys at 0xffffffff7d5dc740 0xffffffff7d5dc740: __pollsys+0x0004: ta %icc,0x0000000000000040 Current function is orterun 1049 opal_event_loop(orte_event_base, OPAL_EVLOOP_ONCE); (dbx) (dbx) (dbx) (dbx) check -all dbx: warning: check -all will be turned on in the next run of the process access checking - OFF memuse checking - OFF (dbx) run -np 1 a.out Running: mpiexec -np 1 a.out (process id 11054) Reading rtcapihook.so Reading libdl.so.1 Reading rtcaudit.so Reading libmapmalloc.so.1 Reading rtcboot.so Reading librtc.so Reading libmd_psr.so.1 RTC: Enabling Error Checking... RTC: Using UltraSparc trap mechanism RTC: See `help rtc showmap' and `help rtc limitations' for details. RTC: Running program... Read from uninitialized (rui) on thread 1: Attempting to read 4 bytes at address 0xffffffff7fffd438 which is 184 bytes above the current stack pointer Variable is 'index' t@1 (l@1) stopped in var_find at line 802 in file "mca_base_var.c" 802 return (OPAL_SUCCESS != ret) ? ret : index; (dbx) I'm sorry that you have so much trouble with me and Solaris. On the other hand I still hope that you can solve the problem(s). Once more thank you very much for your help in advance. Kind regards Siegmar