Re: [OMPI devel] Bus error with openmpi-1.7.4rc1 on Solaris

2013-12-19 Thread Jeff Squyres (jsquyres)
Siegmar --

So it looks like the net problem is fixed; good.  I'll commit and CMR that.

For the DDT test, can you give us access to this machine?  It might help speed 
debugging a lot.  (I'll let Nathan reply about the var problem)

If not, can you provide the following information about the DDT test:

1. It SIGBUS's at a point; can you send the full backtrace?
2. It complains about a misaligned read of a variable and shows its address.  
Can you print the values of all the parameters of the function so that we can 
see *which* one it is using for the misaligned read?  (the printf is using 4 
different variables, and we don't know which one is causing the misaligned read)


On Dec 19, 2013, at 8:52 AM, Siegmar Gross 
 wrote:

> Hi,
> 
> at first thank you very much for your help.
> 
> 1st patch:
> 
>> Can you apply the following patch to a trunk tarball and see if it works
>> for you?
> 
> 2nd patch:
> 
>> Found the problem. Was accessing a boolean variable using intval. That
>> is a bug that has gone unnoticed on all platforms but thankfully Solaris
>> caught it.
>> 
>> Please try the attached patch.
> 
> 
> I applied both patches manually to openmpi-1.9a1r29972, because
> my patch program couldn't use the patches. Unfortunately I still
> get a Bus Error. Hopefully I didn't make a mistake applying your
> patches. Therefore I show you a "diff" for my files. By the way,
> I tried to apply your patches with "patch -b -i ".
> Is it necessary to use a different command?
> 
> 
> tyr openmpi-1.9a1r29972 161 ls -l opal/mca/base/mca_base_var.c*
> -rw-r--r-- 1 fd1026 inf 60418 Dec 19 08:35 opal/mca/base/mca_base_var.c
> -rw-r--r-- 1 fd1026 inf 60236 Dec 19 03:05 opal/mca/base/mca_base_var.c.orig
> tyr openmpi-1.9a1r29972 162 diff opal/mca/base/mca_base_var.c*
> 1685,1689c1685
> mbv_type) {
>  var->mbv_enumerator->string_from_value(var->mbv_enumerator, 
> value->boolval, );
> <} else {
>  var->mbv_enumerator->string_from_value(var->mbv_enumerator, 
> value->intval, );
> <}
> ---
>>ret = var->mbv_enumerator->string_from_value(var->mbv_enumerator, 
> value->intval, );
> tyr openmpi-1.9a1r29972 163 
> 
> 
> 
> tyr openmpi-1.9a1r29972 165 ls -l opal/util/net.c*
> -rw-r--r-- 1 fd1026 inf 12922 Dec 19 07:55 opal/util/net.c
> -rw-r--r-- 1 fd1026 inf 12675 Dec 19 03:05 opal/util/net.c.orig
> tyr openmpi-1.9a1r29972 166 diff opal/util/net.c*
> 267,271c267,268
> < struct sockaddr_in inaddr1, inaddr2;
> < /* Use temporary variables and memcpy's so that we don't
>  < memcpy(, addr1, sizeof(inaddr1));
> < memcpy(, addr2, sizeof(inaddr2));
> ---
>>const struct sockaddr_in *inaddr1 = (struct sockaddr_in*) addr1;
>>const struct sockaddr_in *inaddr2 = (struct sockaddr_in*) addr2;
> 274,275c271,272
> < if((inaddr1.sin_addr.s_addr & netmask) ==
> <(inaddr2.sin_addr.s_addr & netmask)) {
> ---
>>if((inaddr1->sin_addr.s_addr & netmask) ==
>>   (inaddr2->sin_addr.s_addr & netmask)) {
> 284,290c281,284
> < struct sockaddr_in6 inaddr1, inaddr2;
> < /* Use temporary variables and memcpy's so that we don't
>  < memcpy(, addr1, sizeof(inaddr1));
> < memcpy(, addr2, sizeof(inaddr2));
> < struct in6_addr *a6_1 = (struct in6_addr*) _addr;
> < struct in6_addr *a6_2 = (struct in6_addr*) _addr;
> ---
>>const struct sockaddr_in6 *inaddr1 = (struct sockaddr_in6*) addr1;
>>const struct sockaddr_in6 *inaddr2 = (struct sockaddr_in6*) addr2;
>>struct in6_addr *a6_1 = (struct in6_addr*) >sin6_addr;
>>struct in6_addr *a6_2 = (struct in6_addr*) >sin6_addr;
> tyr openmpi-1.9a1r29972 167 
> 
> 
> 
> Now my debug information.
> 
> tyr fd1026 52 cd /usr/local/openmpi-1.9_64_cc/bin/
> tyr bin 53 /opt/solstudio12.3/bin/sparcv9/dbx ompi_info
> For information about new features see `help changes'
> To remove this message, put `dbxenv suppress_startup_message 7.9' in your 
> .dbxrc
> Reading ompi_info
> Reading ld.so.1
> Reading libmpi.so.0.0.0
> Reading libopen-rte.so.0.0.0
> Reading libopen-pal.so.0.0.0
> Reading libsendfile.so.1
> Reading libpicl.so.1
> Reading libkstat.so.1
> Reading liblgrp.so.1
> Reading libsocket.so.1
> Reading libnsl.so.1
> Reading librt.so.1
> Reading libm.so.2
> Reading libthread.so.1
> Reading libc.so.1
> Reading libdoor.so.1
> Reading libaio.so.1
> Reading libmd.so.1
> (dbx) run -a
> Running: ompi_info -a 
> (process id 10998)
> Reading libc_psr.so.1
> ...
>MCA compress: parameter "compress_base_verbose" (current value:
>  "-1", data source: default, level: 8 dev/detail,
>  type: int)
> 

Re: [OMPI devel] Bus error with openmpi-1.7.4rc1 on Solaris

2013-12-19 Thread Siegmar Gross
Hi,

at first thank you very much for your help.

1st patch:

> Can you apply the following patch to a trunk tarball and see if it works
> for you?

2nd patch:

> Found the problem. Was accessing a boolean variable using intval. That
> is a bug that has gone unnoticed on all platforms but thankfully Solaris
> caught it.
> 
> Please try the attached patch.


I applied both patches manually to openmpi-1.9a1r29972, because
my patch program couldn't use the patches. Unfortunately I still
get a Bus Error. Hopefully I didn't make a mistake applying your
patches. Therefore I show you a "diff" for my files. By the way,
I tried to apply your patches with "patch -b -i ".
Is it necessary to use a different command?


tyr openmpi-1.9a1r29972 161 ls -l opal/mca/base/mca_base_var.c*
-rw-r--r-- 1 fd1026 inf 60418 Dec 19 08:35 opal/mca/base/mca_base_var.c
-rw-r--r-- 1 fd1026 inf 60236 Dec 19 03:05 opal/mca/base/mca_base_var.c.orig
tyr openmpi-1.9a1r29972 162 diff opal/mca/base/mca_base_var.c*
1685,1689c1685
mbv_type) {
mbv_enumerator->string_from_value(var->mbv_enumerator, 
value->boolval, );
<} else {
mbv_enumerator->string_from_value(var->mbv_enumerator, 
value->intval, );
<}
---
> ret = var->mbv_enumerator->string_from_value(var->mbv_enumerator, 
value->intval, );
tyr openmpi-1.9a1r29972 163 



tyr openmpi-1.9a1r29972 165 ls -l opal/util/net.c*
-rw-r--r-- 1 fd1026 inf 12922 Dec 19 07:55 opal/util/net.c
-rw-r--r-- 1 fd1026 inf 12675 Dec 19 03:05 opal/util/net.c.orig
tyr openmpi-1.9a1r29972 166 diff opal/util/net.c*
267,271c267,268
< struct sockaddr_in inaddr1, inaddr2;
< /* Use temporary variables and memcpy's so that we don't
 const struct sockaddr_in *inaddr1 = (struct sockaddr_in*) addr1;
> const struct sockaddr_in *inaddr2 = (struct sockaddr_in*) addr2;
274,275c271,272
< if((inaddr1.sin_addr.s_addr & netmask) ==
<(inaddr2.sin_addr.s_addr & netmask)) {
---
> if((inaddr1->sin_addr.s_addr & netmask) ==
>(inaddr2->sin_addr.s_addr & netmask)) {
284,290c281,284
< struct sockaddr_in6 inaddr1, inaddr2;
< /* Use temporary variables and memcpy's so that we don't
 const struct sockaddr_in6 *inaddr1 = (struct sockaddr_in6*) addr1;
> const struct sockaddr_in6 *inaddr2 = (struct sockaddr_in6*) addr2;
> struct in6_addr *a6_1 = (struct in6_addr*) >sin6_addr;
> struct in6_addr *a6_2 = (struct in6_addr*) >sin6_addr;
tyr openmpi-1.9a1r29972 167 



Now my debug information.

tyr fd1026 52 cd /usr/local/openmpi-1.9_64_cc/bin/
tyr bin 53 /opt/solstudio12.3/bin/sparcv9/dbx ompi_info
For information about new features see `help changes'
To remove this message, put `dbxenv suppress_startup_message 7.9' in your .dbxrc
Reading ompi_info
Reading ld.so.1
Reading libmpi.so.0.0.0
Reading libopen-rte.so.0.0.0
Reading libopen-pal.so.0.0.0
Reading libsendfile.so.1
Reading libpicl.so.1
Reading libkstat.so.1
Reading liblgrp.so.1
Reading libsocket.so.1
Reading libnsl.so.1
Reading librt.so.1
Reading libm.so.2
Reading libthread.so.1
Reading libc.so.1
Reading libdoor.so.1
Reading libaio.so.1
Reading libmd.so.1
(dbx) run -a
Running: ompi_info -a 
(process id 10998)
Reading libc_psr.so.1
...
MCA compress: parameter "compress_base_verbose" (current value:
  "-1", data source: default, level: 8 dev/detail,
  type: int)
  Verbosity level for the compress framework (0 = no
  verbosity)
t@1 (l@1) signal BUS (invalid address alignment) in var_value_string
  at line 1680 in file "mca_base_var.c"
 1680  ret = asprintf (value_string, var_type_formats[var->mbv_type],
  value[0]);
(dbx) 
(dbx) 
(dbx) check -all
dbx: warning: check -all will be turned on in the next run of the process
access checking - OFF
memuse checking - OFF
(dbx) run -a
Running: ompi_info -a 
(process id 11000)
Reading rtcapihook.so
Reading libdl.so.1
Reading rtcaudit.so
Reading libmapmalloc.so.1
Reading rtcboot.so
Reading librtc.so
Reading libmd_psr.so.1
RTC: Enabling Error Checking...
RTC: Using UltraSparc trap mechanism
RTC: See `help rtc showmap' and `help rtc limitations' for details.
RTC: Running program...
Read from uninitialized (rui) on thread 1:
Attempting to read 4 bytes at address 0x7fffd5f8
which is 184 bytes above the current stack pointer
Variable is 'index'
t@1 (l@1) stopped in 

Re: [OMPI devel] Bus error with openmpi-1.7.4rc1 on Solaris

2013-12-18 Thread Nathan Hjelm
Found the problem. Was accessing a boolean variable using intval. That
is a bug that has gone unnoticed on all platforms but thankfully Solaris
caught it.

Please try the attached patch.

-Nathan

On Wed, Dec 18, 2013 at 12:27:29PM +0100, Siegmar Gross wrote:
> Hi,
> 
> today I installed openmpi-1.7.4rc1 on Solaris 10 Sparc with Sun
> C 5.12. Unfortunately my problems with bus errors, which I reported
> December 4th for openmpi-1.7.4a1r29784 at us...@open-mpi.org, are
> not solved yet. Has somebody time to look into that matter or is
> Solaris support abandoned, so that I have to stay with openmpi-1.6.x
> in the future? Thank you very much for any help in advance.
> 
> 
> Kind regards
> 
> Siegmar
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
diff --git a/opal/mca/base/mca_base_var.c b/opal/mca/base/mca_base_var.c
index 7b55eb8..c043c06 100644
--- a/opal/mca/base/mca_base_var.c
+++ b/opal/mca/base/mca_base_var.c
@@ -1682,7 +1682,11 @@ static int var_value_string (mca_base_var_t *var, char 
**value_string)
 
 ret = (0 > ret) ? OPAL_ERR_OUT_OF_RESOURCE : OPAL_SUCCESS;
 } else {
-ret = var->mbv_enumerator->string_from_value(var->mbv_enumerator, 
value->intval, );
+if (MCA_BASE_VAR_TYPE_BOOL == var->mbv_type) {
+ret = var->mbv_enumerator->string_from_value(var->mbv_enumerator, 
value->boolval, );
+} else {
+ret = var->mbv_enumerator->string_from_value(var->mbv_enumerator, 
value->intval, );
+}
 
 *value_string = strdup (tmp);
 if (NULL == value_string) {


pgpFNtma5UKPz.pgp
Description: PGP signature


Re: [OMPI devel] Bus error with openmpi-1.7.4rc1 on Solaris

2013-12-18 Thread Jeff Squyres (jsquyres)
Siegmar --

Thanks for keeping us honest!  I just filed three tickets with the issues you 
reported:

https://svn.open-mpi.org/trac/ompi/ticket/3988
https://svn.open-mpi.org/trac/ompi/ticket/3989
https://svn.open-mpi.org/trac/ompi/ticket/3990


On Dec 18, 2013, at 6:27 AM, Siegmar Gross 
 wrote:

> Hi,
> 
> today I installed openmpi-1.7.4rc1 on Solaris 10 Sparc with Sun
> C 5.12. Unfortunately my problems with bus errors, which I reported
> December 4th for openmpi-1.7.4a1r29784 at us...@open-mpi.org, are
> not solved yet. Has somebody time to look into that matter or is
> Solaris support abandoned, so that I have to stay with openmpi-1.6.x
> in the future? Thank you very much for any help in advance.
> 
> 
> Kind regards
> 
> Siegmar
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



[OMPI devel] Bus error with openmpi-1.7.4rc1 on Solaris

2013-12-18 Thread Siegmar Gross
Hi,

today I installed openmpi-1.7.4rc1 on Solaris 10 Sparc with Sun
C 5.12. Unfortunately my problems with bus errors, which I reported
December 4th for openmpi-1.7.4a1r29784 at us...@open-mpi.org, are
not solved yet. Has somebody time to look into that matter or is
Solaris support abandoned, so that I have to stay with openmpi-1.6.x
in the future? Thank you very much for any help in advance.


Kind regards

Siegmar