Hi,
at first thank you very much for your help.
1st patch:
> Can you apply the following patch to a trunk tarball and see if it works
> for you?
2nd patch:
> Found the problem. Was accessing a boolean variable using intval. That
> is a bug that has gone unnoticed on all platforms but thankfully Solaris
> caught it.
>
> Please try the attached patch.
I applied both patches manually to openmpi-1.9a1r29972, because
my patch program couldn't use the patches. Unfortunately I still
get a Bus Error. Hopefully I didn't make a mistake applying your
patches. Therefore I show you a "diff" for my files. By the way,
I tried to apply your patches with "patch -b -i <your file>".
Is it necessary to use a different command?
tyr openmpi-1.9a1r29972 161 ls -l opal/mca/base/mca_base_var.c*
-rw-r--r-- 1 fd1026 inf 60418 Dec 19 08:35 opal/mca/base/mca_base_var.c
-rw-r--r-- 1 fd1026 inf 60236 Dec 19 03:05 opal/mca/base/mca_base_var.c.orig
tyr openmpi-1.9a1r29972 162 diff opal/mca/base/mca_base_var.c*
1685,1689c1685
< if (MCA_BASE_VAR_TYPE_BOOL == var->mbv_type) {
< ret = var->mbv_enumerator->string_from_value(var->mbv_enumerator,
value->boolval, &tmp);
< } else {
< ret = var->mbv_enumerator->string_from_value(var->mbv_enumerator,
value->intval, &tmp);
< }
---
> ret = var->mbv_enumerator->string_from_value(var->mbv_enumerator,
value->intval, &tmp);
tyr openmpi-1.9a1r29972 163
tyr openmpi-1.9a1r29972 165 ls -l opal/util/net.c*
-rw-r--r-- 1 fd1026 inf 12922 Dec 19 07:55 opal/util/net.c
-rw-r--r-- 1 fd1026 inf 12675 Dec 19 03:05 opal/util/net.c.orig
tyr openmpi-1.9a1r29972 166 diff opal/util/net.c*
267,271c267,268
< struct sockaddr_in inaddr1, inaddr2;
< /* Use temporary variables and memcpy's so that we don't
< run into bus errors on Solaris/SPARC */
< memcpy(&inaddr1, addr1, sizeof(inaddr1));
< memcpy(&inaddr2, addr2, sizeof(inaddr2));
---
> const struct sockaddr_in *inaddr1 = (struct sockaddr_in*) addr1;
> const struct sockaddr_in *inaddr2 = (struct sockaddr_in*) addr2;
274,275c271,272
< if((inaddr1.sin_addr.s_addr & netmask) ==
< (inaddr2.sin_addr.s_addr & netmask)) {
---
> if((inaddr1->sin_addr.s_addr & netmask) ==
> (inaddr2->sin_addr.s_addr & netmask)) {
284,290c281,284
< struct sockaddr_in6 inaddr1, inaddr2;
< /* Use temporary variables and memcpy's so that we don't
< run into bus errors on Solaris/SPARC */
< memcpy(&inaddr1, addr1, sizeof(inaddr1));
< memcpy(&inaddr2, addr2, sizeof(inaddr2));
< struct in6_addr *a6_1 = (struct in6_addr*) &inaddr1.sin6_addr;
< struct in6_addr *a6_2 = (struct in6_addr*) &inaddr2.sin6_addr;
---
> const struct sockaddr_in6 *inaddr1 = (struct sockaddr_in6*) addr1;
> const struct sockaddr_in6 *inaddr2 = (struct sockaddr_in6*) addr2;
> struct in6_addr *a6_1 = (struct in6_addr*) &inaddr1->sin6_addr;
> struct in6_addr *a6_2 = (struct in6_addr*) &inaddr2->sin6_addr;
tyr openmpi-1.9a1r29972 167
Now my debug information.
tyr fd1026 52 cd /usr/local/openmpi-1.9_64_cc/bin/
tyr bin 53 /opt/solstudio12.3/bin/sparcv9/dbx ompi_info
For information about new features see `help changes'
To remove this message, put `dbxenv suppress_startup_message 7.9' in your .dbxrc
Reading ompi_info
Reading ld.so.1
Reading libmpi.so.0.0.0
Reading libopen-rte.so.0.0.0
Reading libopen-pal.so.0.0.0
Reading libsendfile.so.1
Reading libpicl.so.1
Reading libkstat.so.1
Reading liblgrp.so.1
Reading libsocket.so.1
Reading libnsl.so.1
Reading librt.so.1
Reading libm.so.2
Reading libthread.so.1
Reading libc.so.1
Reading libdoor.so.1
Reading libaio.so.1
Reading libmd.so.1
(dbx) run -a
Running: ompi_info -a
(process id 10998)
Reading libc_psr.so.1
...
MCA compress: parameter "compress_base_verbose" (current value:
"-1", data source: default, level: 8 dev/detail,
type: int)
Verbosity level for the compress framework (0 = no
verbosity)
t@1 (l@1) signal BUS (invalid address alignment) in var_value_string
at line 1680 in file "mca_base_var.c"
1680 ret = asprintf (value_string, var_type_formats[var->mbv_type],
value[0]);
(dbx)
(dbx)
(dbx) check -all
dbx: warning: check -all will be turned on in the next run of the process
access checking - OFF
memuse checking - OFF
(dbx) run -a
Running: ompi_info -a
(process id 11000)
Reading rtcapihook.so
Reading libdl.so.1
Reading rtcaudit.so
Reading libmapmalloc.so.1
Reading rtcboot.so
Reading librtc.so
Reading libmd_psr.so.1
RTC: Enabling Error Checking...
RTC: Using UltraSparc trap mechanism
RTC: See `help rtc showmap' and `help rtc limitations' for details.
RTC: Running program...
Read from uninitialized (rui) on thread 1:
Attempting to read 4 bytes at address 0xffffffff7fffd5f8
which is 184 bytes above the current stack pointer
Variable is 'index'
t@1 (l@1) stopped in var_find at line 802 in file "mca_base_var.c"
802 return (OPAL_SUCCESS != ret) ? ret : index;
(dbx)
In my opinion it is the same error as before.
I still get a Bus Error with "make check".
tyr bin 54 cd
/export2/src/openmpi-1.9/openmpi-1.9a1r29972-SunOS.sparc.64_cc/test/datatype/.li
bs/
tyr .libs 55 /opt/solstudio12.3/bin/sparcv9/dbx ddt_raw
For information about new features see `help changes'
To remove this message, put `dbxenv suppress_startup_message 7.9' in your .dbxrc
Reading ddt_raw
Reading ld.so.1
Reading libmpi.so.0.0.0
Reading libopen-rte.so.0.0.0
Reading libopen-pal.so.0.0.0
Reading libsendfile.so.1
Reading libpicl.so.1
Reading libkstat.so.1
Reading liblgrp.so.1
Reading libsocket.so.1
Reading libnsl.so.1
Reading librt.so.1
Reading libm.so.2
Reading libthread.so.1
Reading libc.so.1
Reading libdoor.so.1
Reading libaio.so.1
Reading libmd.so.1
(dbx) run
Running: ddt_raw
(process id 11018)
Reading libc_psr.so.1
#
* TEST INVERSED VECTOR
#
t@1 (l@1) signal BUS (invalid address alignment) in opal_convertor_raw
at line 71 in file "opal_convertor_raw.c"
71 DO_DEBUG( opal_output( 0, "opal_convertor_raw( %p, {%p,
%u}, %lu )\n", (void*)pConvertor,
(dbx)
Once more I think it is the same error. I have the same problem with
my small program.
tyr small_prog 62 mpicc init_finalize.c
tyr small_prog 63 /opt/solstudio12.3/bin/sparcv9/dbx \
/usr/local/openmpi-1.9_64_cc/bin/mpiexec
For information about new features see `help changes'
To remove this message, put `dbxenv suppress_startup_message 7.9'
in your .dbxrc
Reading mpiexec
Reading ld.so.1
Reading libopen-rte.so.0.0.0
Reading libopen-pal.so.0.0.0
Reading libsendfile.so.1
Reading libpicl.so.1
Reading libkstat.so.1
Reading liblgrp.so.1
Reading libsocket.so.1
Reading libnsl.so.1
Reading librt.so.1
Reading libm.so.2
Reading libthread.so.1
Reading libc.so.1
Reading libdoor.so.1
Reading libaio.so.1
Reading libmd.so.1
(dbx)
(dbx) run -np 1 a.out
Running: mpiexec -np 1 a.out
(process id 11050)
Reading libc_psr.so.1
Reading mca_shmem_mmap.so
Reading libmp.so.2
Reading libscf.so.1
Reading libuutil.so.1
Reading libgen.so.1
Reading mca_shmem_posix.so
Reading mca_shmem_sysv.so
Reading mca_ess_env.so
Reading mca_ess_hnp.so
Reading mca_ess_singleton.so
Reading mca_ess_tool.so
Reading mca_pstat_test.so
Reading mca_state_app.so
Reading mca_state_hnp.so
Reading mca_state_novm.so
Reading mca_state_orted.so
Reading mca_state_staged_hnp.so
Reading mca_state_staged_orted.so
Reading mca_state_tool.so
Reading mca_errmgr_default_app.so
Reading mca_errmgr_default_hnp.so
Reading mca_errmgr_default_orted.so
Reading mca_errmgr_default_tool.so
Reading mca_plm_rsh.so
Reading mca_oob_tcp.so
Reading mca_rml_oob.so
Reading mca_routed_binomial.so
Reading mca_routed_debruijn.so
Reading mca_routed_direct.so
Reading mca_routed_radix.so
Reading mca_db_hash.so
Reading mca_db_print.so
Reading mca_grpcomm_bad.so
Reading mca_ras_simulator.so
Reading mca_rmaps_lama.so
Reading mca_rmaps_mindist.so
Reading mca_rmaps_ppr.so
Reading mca_rmaps_rank_file.so
Reading mca_rmaps_resilient.so
Reading mca_rmaps_round_robin.so
Reading mca_rmaps_seq.so
Reading mca_rmaps_staged.so
Reading mca_odls_default.so
Reading mca_iof_hnp.so
Reading mca_iof_mr_hnp.so
Reading mca_iof_mr_orted.so
Reading mca_iof_orted.so
Reading mca_iof_tool.so
Reading mca_filem_raw.so
Reading mca_dfs_app.so
Reading mca_dfs_orted.so
Reading mca_dfs_test.so
Now the program hangs.
^Cdbx: warning: Interrupt ignored but forwarded to child.
t@1 (l@1) signal INT (Interrupt) in __pollsys at 0xffffffff7d5dc740
0xffffffff7d5dc740: __pollsys+0x0004: ta %icc,0x0000000000000040
Current function is orterun
1049 opal_event_loop(orte_event_base, OPAL_EVLOOP_ONCE);
(dbx)
(dbx)
(dbx)
(dbx) check -all
dbx: warning: check -all will be turned on in the next run of the process
access checking - OFF
memuse checking - OFF
(dbx) run -np 1 a.out
Running: mpiexec -np 1 a.out
(process id 11054)
Reading rtcapihook.so
Reading libdl.so.1
Reading rtcaudit.so
Reading libmapmalloc.so.1
Reading rtcboot.so
Reading librtc.so
Reading libmd_psr.so.1
RTC: Enabling Error Checking...
RTC: Using UltraSparc trap mechanism
RTC: See `help rtc showmap' and `help rtc limitations' for details.
RTC: Running program...
Read from uninitialized (rui) on thread 1:
Attempting to read 4 bytes at address 0xffffffff7fffd438
which is 184 bytes above the current stack pointer
Variable is 'index'
t@1 (l@1) stopped in var_find at line 802 in file "mca_base_var.c"
802 return (OPAL_SUCCESS != ret) ? ret : index;
(dbx)
I'm sorry that you have so much trouble with me and Solaris. On the
other hand I still hope that you can solve the problem(s). Once more
thank you very much for your help in advance.
Kind regards
Siegmar