Re: [OMPI devel] Bus error with openmpi-1.7.4rc1 on Solaris
Siegmar -- So it looks like the net problem is fixed; good. I'll commit and CMR that. For the DDT test, can you give us access to this machine? It might help speed debugging a lot. (I'll let Nathan reply about the var problem) If not, can you provide the following information about the DDT test: 1. It SIGBUS's at a point; can you send the full backtrace? 2. It complains about a misaligned read of a variable and shows its address. Can you print the values of all the parameters of the function so that we can see *which* one it is using for the misaligned read? (the printf is using 4 different variables, and we don't know which one is causing the misaligned read) On Dec 19, 2013, at 8:52 AM, Siegmar Grosswrote: > Hi, > > at first thank you very much for your help. > > 1st patch: > >> Can you apply the following patch to a trunk tarball and see if it works >> for you? > > 2nd patch: > >> Found the problem. Was accessing a boolean variable using intval. That >> is a bug that has gone unnoticed on all platforms but thankfully Solaris >> caught it. >> >> Please try the attached patch. > > > I applied both patches manually to openmpi-1.9a1r29972, because > my patch program couldn't use the patches. Unfortunately I still > get a Bus Error. Hopefully I didn't make a mistake applying your > patches. Therefore I show you a "diff" for my files. By the way, > I tried to apply your patches with "patch -b -i ". > Is it necessary to use a different command? > > > tyr openmpi-1.9a1r29972 161 ls -l opal/mca/base/mca_base_var.c* > -rw-r--r-- 1 fd1026 inf 60418 Dec 19 08:35 opal/mca/base/mca_base_var.c > -rw-r--r-- 1 fd1026 inf 60236 Dec 19 03:05 opal/mca/base/mca_base_var.c.orig > tyr openmpi-1.9a1r29972 162 diff opal/mca/base/mca_base_var.c* > 1685,1689c1685 > mbv_type) { > var->mbv_enumerator->string_from_value(var->mbv_enumerator, > value->boolval, ); > <} else { > var->mbv_enumerator->string_from_value(var->mbv_enumerator, > value->intval, ); > <} > --- >>ret = var->mbv_enumerator->string_from_value(var->mbv_enumerator, > value->intval, ); > tyr openmpi-1.9a1r29972 163 > > > > tyr openmpi-1.9a1r29972 165 ls -l opal/util/net.c* > -rw-r--r-- 1 fd1026 inf 12922 Dec 19 07:55 opal/util/net.c > -rw-r--r-- 1 fd1026 inf 12675 Dec 19 03:05 opal/util/net.c.orig > tyr openmpi-1.9a1r29972 166 diff opal/util/net.c* > 267,271c267,268 > < struct sockaddr_in inaddr1, inaddr2; > < /* Use temporary variables and memcpy's so that we don't > < memcpy(, addr1, sizeof(inaddr1)); > < memcpy(, addr2, sizeof(inaddr2)); > --- >>const struct sockaddr_in *inaddr1 = (struct sockaddr_in*) addr1; >>const struct sockaddr_in *inaddr2 = (struct sockaddr_in*) addr2; > 274,275c271,272 > < if((inaddr1.sin_addr.s_addr & netmask) == > <(inaddr2.sin_addr.s_addr & netmask)) { > --- >>if((inaddr1->sin_addr.s_addr & netmask) == >> (inaddr2->sin_addr.s_addr & netmask)) { > 284,290c281,284 > < struct sockaddr_in6 inaddr1, inaddr2; > < /* Use temporary variables and memcpy's so that we don't > < memcpy(, addr1, sizeof(inaddr1)); > < memcpy(, addr2, sizeof(inaddr2)); > < struct in6_addr *a6_1 = (struct in6_addr*) _addr; > < struct in6_addr *a6_2 = (struct in6_addr*) _addr; > --- >>const struct sockaddr_in6 *inaddr1 = (struct sockaddr_in6*) addr1; >>const struct sockaddr_in6 *inaddr2 = (struct sockaddr_in6*) addr2; >>struct in6_addr *a6_1 = (struct in6_addr*) >sin6_addr; >>struct in6_addr *a6_2 = (struct in6_addr*) >sin6_addr; > tyr openmpi-1.9a1r29972 167 > > > > Now my debug information. > > tyr fd1026 52 cd /usr/local/openmpi-1.9_64_cc/bin/ > tyr bin 53 /opt/solstudio12.3/bin/sparcv9/dbx ompi_info > For information about new features see `help changes' > To remove this message, put `dbxenv suppress_startup_message 7.9' in your > .dbxrc > Reading ompi_info > Reading ld.so.1 > Reading libmpi.so.0.0.0 > Reading libopen-rte.so.0.0.0 > Reading libopen-pal.so.0.0.0 > Reading libsendfile.so.1 > Reading libpicl.so.1 > Reading libkstat.so.1 > Reading liblgrp.so.1 > Reading libsocket.so.1 > Reading libnsl.so.1 > Reading librt.so.1 > Reading libm.so.2 > Reading libthread.so.1 > Reading libc.so.1 > Reading libdoor.so.1 > Reading libaio.so.1 > Reading libmd.so.1 > (dbx) run -a > Running: ompi_info -a > (process id 10998) > Reading libc_psr.so.1 > ... >MCA compress: parameter "compress_base_verbose" (current value: > "-1", data source: default, level: 8 dev/detail, > type: int) >
Re: [OMPI devel] Bus error with openmpi-1.7.4rc1 on Solaris
Hi, at first thank you very much for your help. 1st patch: > Can you apply the following patch to a trunk tarball and see if it works > for you? 2nd patch: > Found the problem. Was accessing a boolean variable using intval. That > is a bug that has gone unnoticed on all platforms but thankfully Solaris > caught it. > > Please try the attached patch. I applied both patches manually to openmpi-1.9a1r29972, because my patch program couldn't use the patches. Unfortunately I still get a Bus Error. Hopefully I didn't make a mistake applying your patches. Therefore I show you a "diff" for my files. By the way, I tried to apply your patches with "patch -b -i ". Is it necessary to use a different command? tyr openmpi-1.9a1r29972 161 ls -l opal/mca/base/mca_base_var.c* -rw-r--r-- 1 fd1026 inf 60418 Dec 19 08:35 opal/mca/base/mca_base_var.c -rw-r--r-- 1 fd1026 inf 60236 Dec 19 03:05 opal/mca/base/mca_base_var.c.orig tyr openmpi-1.9a1r29972 162 diff opal/mca/base/mca_base_var.c* 1685,1689c1685mbv_type) { mbv_enumerator->string_from_value(var->mbv_enumerator, value->boolval, ); <} else { mbv_enumerator->string_from_value(var->mbv_enumerator, value->intval, ); <} --- > ret = var->mbv_enumerator->string_from_value(var->mbv_enumerator, value->intval, ); tyr openmpi-1.9a1r29972 163 tyr openmpi-1.9a1r29972 165 ls -l opal/util/net.c* -rw-r--r-- 1 fd1026 inf 12922 Dec 19 07:55 opal/util/net.c -rw-r--r-- 1 fd1026 inf 12675 Dec 19 03:05 opal/util/net.c.orig tyr openmpi-1.9a1r29972 166 diff opal/util/net.c* 267,271c267,268 < struct sockaddr_in inaddr1, inaddr2; < /* Use temporary variables and memcpy's so that we don't const struct sockaddr_in *inaddr1 = (struct sockaddr_in*) addr1; > const struct sockaddr_in *inaddr2 = (struct sockaddr_in*) addr2; 274,275c271,272 < if((inaddr1.sin_addr.s_addr & netmask) == <(inaddr2.sin_addr.s_addr & netmask)) { --- > if((inaddr1->sin_addr.s_addr & netmask) == >(inaddr2->sin_addr.s_addr & netmask)) { 284,290c281,284 < struct sockaddr_in6 inaddr1, inaddr2; < /* Use temporary variables and memcpy's so that we don't const struct sockaddr_in6 *inaddr1 = (struct sockaddr_in6*) addr1; > const struct sockaddr_in6 *inaddr2 = (struct sockaddr_in6*) addr2; > struct in6_addr *a6_1 = (struct in6_addr*) >sin6_addr; > struct in6_addr *a6_2 = (struct in6_addr*) >sin6_addr; tyr openmpi-1.9a1r29972 167 Now my debug information. tyr fd1026 52 cd /usr/local/openmpi-1.9_64_cc/bin/ tyr bin 53 /opt/solstudio12.3/bin/sparcv9/dbx ompi_info For information about new features see `help changes' To remove this message, put `dbxenv suppress_startup_message 7.9' in your .dbxrc Reading ompi_info Reading ld.so.1 Reading libmpi.so.0.0.0 Reading libopen-rte.so.0.0.0 Reading libopen-pal.so.0.0.0 Reading libsendfile.so.1 Reading libpicl.so.1 Reading libkstat.so.1 Reading liblgrp.so.1 Reading libsocket.so.1 Reading libnsl.so.1 Reading librt.so.1 Reading libm.so.2 Reading libthread.so.1 Reading libc.so.1 Reading libdoor.so.1 Reading libaio.so.1 Reading libmd.so.1 (dbx) run -a Running: ompi_info -a (process id 10998) Reading libc_psr.so.1 ... MCA compress: parameter "compress_base_verbose" (current value: "-1", data source: default, level: 8 dev/detail, type: int) Verbosity level for the compress framework (0 = no verbosity) t@1 (l@1) signal BUS (invalid address alignment) in var_value_string at line 1680 in file "mca_base_var.c" 1680 ret = asprintf (value_string, var_type_formats[var->mbv_type], value[0]); (dbx) (dbx) (dbx) check -all dbx: warning: check -all will be turned on in the next run of the process access checking - OFF memuse checking - OFF (dbx) run -a Running: ompi_info -a (process id 11000) Reading rtcapihook.so Reading libdl.so.1 Reading rtcaudit.so Reading libmapmalloc.so.1 Reading rtcboot.so Reading librtc.so Reading libmd_psr.so.1 RTC: Enabling Error Checking... RTC: Using UltraSparc trap mechanism RTC: See `help rtc showmap' and `help rtc limitations' for details. RTC: Running program... Read from uninitialized (rui) on thread 1: Attempting to read 4 bytes at address 0x7fffd5f8 which is 184 bytes above the current stack pointer Variable is 'index' t@1 (l@1) stopped in
Re: [OMPI devel] Bus error with openmpi-1.7.4rc1 on Solaris
Found the problem. Was accessing a boolean variable using intval. That is a bug that has gone unnoticed on all platforms but thankfully Solaris caught it. Please try the attached patch. -Nathan On Wed, Dec 18, 2013 at 12:27:29PM +0100, Siegmar Gross wrote: > Hi, > > today I installed openmpi-1.7.4rc1 on Solaris 10 Sparc with Sun > C 5.12. Unfortunately my problems with bus errors, which I reported > December 4th for openmpi-1.7.4a1r29784 at us...@open-mpi.org, are > not solved yet. Has somebody time to look into that matter or is > Solaris support abandoned, so that I have to stay with openmpi-1.6.x > in the future? Thank you very much for any help in advance. > > > Kind regards > > Siegmar > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel diff --git a/opal/mca/base/mca_base_var.c b/opal/mca/base/mca_base_var.c index 7b55eb8..c043c06 100644 --- a/opal/mca/base/mca_base_var.c +++ b/opal/mca/base/mca_base_var.c @@ -1682,7 +1682,11 @@ static int var_value_string (mca_base_var_t *var, char **value_string) ret = (0 > ret) ? OPAL_ERR_OUT_OF_RESOURCE : OPAL_SUCCESS; } else { -ret = var->mbv_enumerator->string_from_value(var->mbv_enumerator, value->intval, ); +if (MCA_BASE_VAR_TYPE_BOOL == var->mbv_type) { +ret = var->mbv_enumerator->string_from_value(var->mbv_enumerator, value->boolval, ); +} else { +ret = var->mbv_enumerator->string_from_value(var->mbv_enumerator, value->intval, ); +} *value_string = strdup (tmp); if (NULL == value_string) { pgpFNtma5UKPz.pgp Description: PGP signature
Re: [OMPI devel] Bus error with openmpi-1.7.4rc1 on Solaris
Siegmar -- Thanks for keeping us honest! I just filed three tickets with the issues you reported: https://svn.open-mpi.org/trac/ompi/ticket/3988 https://svn.open-mpi.org/trac/ompi/ticket/3989 https://svn.open-mpi.org/trac/ompi/ticket/3990 On Dec 18, 2013, at 6:27 AM, Siegmar Grosswrote: > Hi, > > today I installed openmpi-1.7.4rc1 on Solaris 10 Sparc with Sun > C 5.12. Unfortunately my problems with bus errors, which I reported > December 4th for openmpi-1.7.4a1r29784 at us...@open-mpi.org, are > not solved yet. Has somebody time to look into that matter or is > Solaris support abandoned, so that I have to stay with openmpi-1.6.x > in the future? Thank you very much for any help in advance. > > > Kind regards > > Siegmar > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
[OMPI devel] Bus error with openmpi-1.7.4rc1 on Solaris
Hi, today I installed openmpi-1.7.4rc1 on Solaris 10 Sparc with Sun C 5.12. Unfortunately my problems with bus errors, which I reported December 4th for openmpi-1.7.4a1r29784 at us...@open-mpi.org, are not solved yet. Has somebody time to look into that matter or is Solaris support abandoned, so that I have to stay with openmpi-1.6.x in the future? Thank you very much for any help in advance. Kind regards Siegmar