Re: [OMPI devel] [OMPI users] still segmentation fault with openmpi-2.0.2rc3 on Linux

2017-01-12 Thread Siegmar Gross
un -np 1 --slot-list 0:0-5,1:0-5 --host motomachi:12 ./spawn_master


this does not work

mpirun -np 1 --host motomachi ./spawn_master # not enough slots available, 
aborts with a user friendly error message
mpirun -np 1 --slot-list 0:0-5,1:0-5 --host motomachi ./spawn_master # 
various errors sm_segment_attach() fails, a task crashes
and this ends up with the following error message

At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[15519,2],0]) is on host: motomachi
  Process 2 ([[15519,2],1]) is on host: unknown!
  BTLs attempted: self tcp

mpirun -np 1 --slot-list 0:0-5,1:0-5 --host motomachi:1 ./spawn_master # 
same error as above
mpirun -np 1 --slot-list 0:0-5,1:0-5 --host motomachi:2 ./spawn_master # 
same error as above


for the record, the following command surprisingly works

mpirun -np 1 --slot-list 0:0-5,1:0-5 --host motomachi:3 --mca btl tcp,self 
./spawn_master



bottom line, my guess is that when the user specifies the --slot-list and 
the --host options
*and* there are no default slot numbers to hosts, we should default to 
using the number
of slots from the slot list.
(e.g. in this case, defaults to --host motomachi:12 instead of (i guess) 
--host motomachi:1)


/* fwiw, i made

https://github.com/open-mpi/ompi/pull/2715 
<https://github.com/open-mpi/ompi/pull/2715>

https://github.com/open-mpi/ompi/pull/2715 
<https://github.com/open-mpi/ompi/pull/2715>

but these are not the root cause */


Cheers,


Gilles



 Forwarded Message 
Subject:Re: [OMPI users] still segmentation fault with openmpi-2.0.2rc3 
on Linux
Date:   Wed, 11 Jan 2017 20:39:02 +0900
From:   Gilles Gouaillardet <gilles.gouaillar...@gmail.com> 
<mailto:gilles.gouaillar...@gmail.com>
Reply-To:   Open MPI Users <us...@lists.open-mpi.org> 
<mailto:us...@lists.open-mpi.org>
To: Open MPI Users <us...@lists.open-mpi.org> 
<mailto:us...@lists.open-mpi.org>



Siegmar,

Your slot list is correct.
An invalid slot list for your node would be 0:1-7,1:0-7

/* and since the test requires only 5 tasks, that could even work with such 
an invalid list.
My vm is single socket with 4 cores, so a 0:0-4 slot list results in an 
unfriendly pmix error */

Bottom line, your test is correct, and there is a bug in v2.0.x that I will 
investigate from tomorrow

Cheers,

Gilles

On Wednesday, January 11, 2017, Siegmar Gross <siegmar.gr...@informatik.hs-fulda.de 
<mailto:siegmar.gr...@informatik.hs-fulda.de>> wrote:

Hi Gilles,

thank you very much for your help. What does incorrect slot list
mean? My machine has two 6-core processors so that I specified
"--slot-list 0:0-5,1:0-5". Does incorrect mean that it isn't
allowed to specify more slots than available, to specify fewer
slots than available, or to specify more slots than needed for
the processes?


Kind regards

Siegmar

Am 11.01.2017 um 10:04 schrieb Gilles Gouaillardet:

Siegmar,

I was able to reproduce the issue on my vm
(No need for a real heterogeneous cluster here)

I will keep digging tomorrow.
Note that if you specify an incorrect slot list, MPI_Comm_spawn 
fails with a very unfriendly error message.
Right now, the 4th spawn'ed task crashes, so this is a different 
issue

Cheers,

Gilles

r...@open-mpi.org wrote:
I think there is some relevant discussion here: 
https://github.com/open-mpi/ompi/issues/1569 
<https://github.com/open-mpi/ompi/issues/1569>

It looks like Gilles had (at least at one point) a fix for master 
when enable-heterogeneous, but I don’t know if that was committed.

On Jan 9, 2017, at 8:23 AM, Howard Pritchard <hpprit...@gmail.com 
<mailto:hpprit...@gmail.com>> wrote:

HI Siegmar,

You have some config parameters I wasn't trying that may have 
some impact.
I'll give a try with these parameters.

This should be enough info for now,

Thanks,

Howard


2017-01-09 0:59 GMT-07:00 Siegmar Gross 
<siegmar.gr...@informatik.hs-fulda.de 
<mailto:siegmar.gr...@informatik.hs-fulda.de>>:

Hi Howard,

I use the following commands to build and insta

Re: [OMPI devel] v2.0.1 PRs: open season

2016-07-16 Thread Siegmar Gross

Hi Jeff,

I didn't find the PR for my problem in your list and I'm waiting
for a solution.

https://github.com/open-mpi/ompi/issues/1573


Kind regards and thank you very much for any help in advance

Siegmar


Am 15.07.2016 um 16:15 schrieb Jeff Squyres (jsquyres):

v2.0.1 is officially open to accept PRs.

Please note that many v2.0.1 PRs still need reviews:

- 36 open v2.0.1 PRs
- only 13 have reviews

Please start getting reviews for your v2.0.1 PRs -- no review, no merge:


https://github.com/open-mpi/ompi-release/pulls?utf8=%E2%9C%93=is%3Apr%20is%3Aopen%20milestone%3Av2.0.1

Also, some of the PRs are a little old -- I just kicked off CI on PRs that 
hadn't had a CI run in the past week (although the Mellanox Jenkins looks like 
it might be failing tests due to a local issue -- hopefully we can get that 
fixed up shortly).



Re: [OMPI devel] Bus error with openmpi-1.7.4rc1 on Solaris

2013-12-19 Thread Siegmar Gross
Hi,

at first thank you very much for your help.

1st patch:

> Can you apply the following patch to a trunk tarball and see if it works
> for you?

2nd patch:

> Found the problem. Was accessing a boolean variable using intval. That
> is a bug that has gone unnoticed on all platforms but thankfully Solaris
> caught it.
> 
> Please try the attached patch.


I applied both patches manually to openmpi-1.9a1r29972, because
my patch program couldn't use the patches. Unfortunately I still
get a Bus Error. Hopefully I didn't make a mistake applying your
patches. Therefore I show you a "diff" for my files. By the way,
I tried to apply your patches with "patch -b -i ".
Is it necessary to use a different command?


tyr openmpi-1.9a1r29972 161 ls -l opal/mca/base/mca_base_var.c*
-rw-r--r-- 1 fd1026 inf 60418 Dec 19 08:35 opal/mca/base/mca_base_var.c
-rw-r--r-- 1 fd1026 inf 60236 Dec 19 03:05 opal/mca/base/mca_base_var.c.orig
tyr openmpi-1.9a1r29972 162 diff opal/mca/base/mca_base_var.c*
1685,1689c1685
mbv_type) {
mbv_enumerator->string_from_value(var->mbv_enumerator, 
value->boolval, );
<} else {
mbv_enumerator->string_from_value(var->mbv_enumerator, 
value->intval, );
<}
---
> ret = var->mbv_enumerator->string_from_value(var->mbv_enumerator, 
value->intval, );
tyr openmpi-1.9a1r29972 163 



tyr openmpi-1.9a1r29972 165 ls -l opal/util/net.c*
-rw-r--r-- 1 fd1026 inf 12922 Dec 19 07:55 opal/util/net.c
-rw-r--r-- 1 fd1026 inf 12675 Dec 19 03:05 opal/util/net.c.orig
tyr openmpi-1.9a1r29972 166 diff opal/util/net.c*
267,271c267,268
< struct sockaddr_in inaddr1, inaddr2;
< /* Use temporary variables and memcpy's so that we don't
 const struct sockaddr_in *inaddr1 = (struct sockaddr_in*) addr1;
> const struct sockaddr_in *inaddr2 = (struct sockaddr_in*) addr2;
274,275c271,272
< if((inaddr1.sin_addr.s_addr & netmask) ==
<(inaddr2.sin_addr.s_addr & netmask)) {
---
> if((inaddr1->sin_addr.s_addr & netmask) ==
>(inaddr2->sin_addr.s_addr & netmask)) {
284,290c281,284
< struct sockaddr_in6 inaddr1, inaddr2;
< /* Use temporary variables and memcpy's so that we don't
 const struct sockaddr_in6 *inaddr1 = (struct sockaddr_in6*) addr1;
> const struct sockaddr_in6 *inaddr2 = (struct sockaddr_in6*) addr2;
> struct in6_addr *a6_1 = (struct in6_addr*) >sin6_addr;
> struct in6_addr *a6_2 = (struct in6_addr*) >sin6_addr;
tyr openmpi-1.9a1r29972 167 



Now my debug information.

tyr fd1026 52 cd /usr/local/openmpi-1.9_64_cc/bin/
tyr bin 53 /opt/solstudio12.3/bin/sparcv9/dbx ompi_info
For information about new features see `help changes'
To remove this message, put `dbxenv suppress_startup_message 7.9' in your .dbxrc
Reading ompi_info
Reading ld.so.1
Reading libmpi.so.0.0.0
Reading libopen-rte.so.0.0.0
Reading libopen-pal.so.0.0.0
Reading libsendfile.so.1
Reading libpicl.so.1
Reading libkstat.so.1
Reading liblgrp.so.1
Reading libsocket.so.1
Reading libnsl.so.1
Reading librt.so.1
Reading libm.so.2
Reading libthread.so.1
Reading libc.so.1
Reading libdoor.so.1
Reading libaio.so.1
Reading libmd.so.1
(dbx) run -a
Running: ompi_info -a 
(process id 10998)
Reading libc_psr.so.1
...
MCA compress: parameter "compress_base_verbose" (current value:
  "-1", data source: default, level: 8 dev/detail,
  type: int)
  Verbosity level for the compress framework (0 = no
  verbosity)
t@1 (l@1) signal BUS (invalid address alignment) in var_value_string
  at line 1680 in file "mca_base_var.c"
 1680  ret = asprintf (value_string, var_type_formats[var->mbv_type],
  value[0]);
(dbx) 
(dbx) 
(dbx) check -all
dbx: warning: check -all will be turned on in the next run of the process
access checking - OFF
memuse checking - OFF
(dbx) run -a
Running: ompi_info -a 
(process id 11000)
Reading rtcapihook.so
Reading libdl.so.1
Reading rtcaudit.so
Reading libmapmalloc.so.1
Reading rtcboot.so
Reading librtc.so
Reading libmd_psr.so.1
RTC: Enabling Error Checking...
RTC: Using UltraSparc trap mechanism
RTC: See `help rtc showmap' and `help rtc limitations' for details.
RTC: Running program...
Read from uninitialized (rui) on thread 1:
Attempting to read 4 bytes at address 0x7fffd5f8
which is 184 bytes above the current stack pointer
Variable is 'index'
t@1 (l@1) stopped in 

[OMPI devel] Bus error with openmpi-1.7.4rc1 on Solaris

2013-12-18 Thread Siegmar Gross
Hi,

today I installed openmpi-1.7.4rc1 on Solaris 10 Sparc with Sun
C 5.12. Unfortunately my problems with bus errors, which I reported
December 4th for openmpi-1.7.4a1r29784 at us...@open-mpi.org, are
not solved yet. Has somebody time to look into that matter or is
Solaris support abandoned, so that I have to stay with openmpi-1.6.x
in the future? Thank you very much for any help in advance.


Kind regards

Siegmar



Re: [OMPI devel] [OMPI users] Error in openmpi-1.9a1r29179

2013-09-18 Thread Siegmar Gross
Hello Josh,

thank you very much for your help. Unfortunately I have still a
problem to build Open MPI.

> I pushed a bunch of fixes, can you please try now.

I tried to build /openmpi-1.9a1r29197 on my platforms and now I get
on all platforms the following error.


linpc1 openmpi-1.9a1r29197-Linux.x86_64.64_cc 117 tail -22 
log.make.Linux.x86_64.64_cc
  CC   base/memheap_base_alloc.lo
"../../../../openmpi-1.9a1r29197/opal/include/opal/sys/amd64/atomic.h", line 
136: warning: parameter in inline asm statement unused: %3
"../../../../openmpi-1.9a1r29197/opal/include/opal/sys/amd64/atomic.h", line 
182: warning: parameter in inline asm statement unused: %2
"../../../../openmpi-1.9a1r29197/opal/include/opal/sys/amd64/atomic.h", line 
203: warning: parameter in inline asm statement unused: %2
"../../../../openmpi-1.9a1r29197/opal/include/opal/sys/amd64/atomic.h", line 
224: warning: parameter in inline asm statement unused: %2
"../../../../openmpi-1.9a1r29197/opal/include/opal/sys/amd64/atomic.h", line 
245: warning: parameter in inline asm statement unused: %2
"../../../../openmpi-1.9a1r29197/opal/include/opal/sys/atomic_impl.h", line 
167: warning: statement not reached
"../../../../openmpi-1.9a1r29197/opal/include/opal/sys/atomic_impl.h", line 
192: warning: statement not reached
"../../../../openmpi-1.9a1r29197/opal/include/opal/sys/atomic_impl.h", line 
217: warning: statement not reached
"../../../../openmpi-1.9a1r29197/oshmem/mca/spml/spml.h", line 76: warning: 
anonymous union declaration
"../../../../openmpi-1.9a1r29197/oshmem/mca/memheap/base/memheap_base_alloc.c", 
line 112: warning: argument mismatch
"../../../../openmpi-1.9a1r29197/oshmem/mca/memheap/base/memheap_base_alloc.c", 
line 119: warning: argument mismatch
"../../../../openmpi-1.9a1r29197/oshmem/mca/memheap/base/memheap_base_alloc.c", 
line 124: warning: argument mismatch
"../../../../openmpi-1.9a1r29197/oshmem/mca/memheap/base/memheap_base_alloc.c", 
line 248: warning: pointer to void or function used in arithmetic
"../../../../openmpi-1.9a1r29197/oshmem/mca/memheap/base/memheap_base_alloc.c", 
line 286: syntax error before or at: |
"../../../../openmpi-1.9a1r29197/oshmem/mca/memheap/base/memheap_base_alloc.c", 
line 300: warning: pointer to void or function used in arithmetic
cc: acomp failed for 
../../../../openmpi-1.9a1r29197/oshmem/mca/memheap/base/memheap_base_alloc.c
make[2]: *** [base/memheap_base_alloc.lo] Error 1
make[2]: Leaving directory 
`/export2/src/openmpi-1.9/openmpi-1.9a1r29197-Linux.x86_64.64_cc/oshmem/mca/memheap'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory 
`/export2/src/openmpi-1.9/openmpi-1.9a1r29197-Linux.x86_64.64_cc/oshmem'
make: *** [all-recursive] Error 1


Kind regards

Siegmar




> -Original Message-
> From: Jeff Squyres (jsquyres) [mailto:jsquy...@cisco.com] 
> Sent: Tuesday, September 17, 2013 6:37 AM
> To: Siegmar Gross; Open MPI Developers List
> Cc: Joshua Ladd
> Subject: Re: [OMPI users] Error in openmpi-1.9a1r29179
> 
> ...moving over to the devel list...
> 
> Dave and I looked at this during a break in the EuroMPI conference, and 
> noticed several things:
> 
> 1. Some of the shmem interfaces are functions (i.e., return non-void) and 
> some are subroutines (i.e., return void).  They're currently all using a 
> single macro 
to declare the interfaces, which assume functions.  So this macro is incorrect 
for subroutines -- you really need 2 macros.
> 
> 2. The macro name is OMPI_GENERATE_FORTRAN_BINDINGS -- why isn't is 
> SHMEM_GENERATE_FORTRAN_BINDINGS?
> 
> 3. I notice that none of the Fortran interfaces are prototyped in shmem.fh.  
> Why not? A shmem person here in Madrid mentioned that there is supposed to be 
> a 
shmem.fh file and a shmem modulefile.
> 
> 
> On Sep 17, 2013, at 8:49 AM, Siegmar Gross 
> <siegmar.gr...@informatik.hs-fulda.de> wrote:
> 
> > Hi,
> > 
> > I tried to install openmpi-1.9a1r29179 on "openSuSE Linux 12.1", 
> > "Solaris 10 x86_64", and "Solaris 10 sparc" with "Sun C 5.12" in 
> > 64-bit mode. Unfortunately "make" breaks with the same error on all 
> > platforms.
> > 
> > tail -15 log.make.Linux.x86_64.64_cc
> > 
> >  CCLD libshmem_c.la
> > make[3]: Leaving directory 
> > `/export2/src/openmpi-1.9/openmpi-1.9a1r29179-Linux.x86_64.64_cc/oshmem/shmem/c'
> > make[2]: Leaving directory 
> > `/export2/src/openmpi-1.9/openmpi-1.9a1r29179-Linux.x86_64.64_cc/oshmem/shmem/c'
> > Making all in shmem/fortran
> > make[2]: Entering directory 
> > `/export2/src/openmpi-1.9/openmpi-1.9a1r29179-Linux.x86_64.64_cc/oshmem/shmem/fortran'
> >  CC   start

Re: [OMPI devel] v1.7.0rc7

2013-02-26 Thread Siegmar Gross
Hi

> This release candidate is the last one we expect to have
> before release, so please test it. Can be downloaded from
> the usual place:
> 
> http://www.open-mpi.org/software/ompi/v1.7/
> 
> Latest changes include:
> 
> * update of the alps/lustre configure code
> * fixed solaris hwloc code
> * various mxm updates
> * removed java bindings (delayed until later release)
> * improved the --report-bindings output
> * a variety of minor cleanups


My rankfiles don't work.

tyr rankfiles 106 ompi_info | grep "MPI:"
Open MPI: 1.7rc7
tyr rankfiles 107 mpiexec -report-bindings -rf rf_ex_linpc hostname
--
All nodes which are allocated for this job are already filled.
--
tyr rankfiles 108 mpiexec -report-bindings -rf rf_ex_sunpc hostname
--
All nodes which are allocated for this job are already filled.
--
tyr rankfiles 109 mpiexec -report-bindings -rf rf_ex_sunpc_linpc hostname
--
All nodes which are allocated for this job are already filled.
--
tyr rankfiles 110 



They work as expected for openmpi-1.6.4.

tyr rankfiles 99 ompi_info | grep "MPI:"
Open MPI: 1.6.4rc4r28039
tyr rankfiles 100 mpiexec -report-bindings -rf rf_ex_linpc hostname
[linpc0:17655] MCW rank 0 bound to socket 0[core 0-1]
  socket 1[core 0-1]: [B B][B B] (slot list 0:0-1,1:0-1)
linpc0
linpc1
[linpc1:06707] MCW rank 1 bound to socket 0[core 0-1]:
  [B B][. .] (slot list 0:0-1)
linpc1
[linpc1:06707] MCW rank 2 bound to socket 1[core 0]:
  [. .][B .] (slot list 1:0)
[linpc1:06707] MCW rank 3 bound to socket 1[core 1]:
  [. .][. B] (slot list 1:1)
linpc1

tyr rankfiles 101 mpiexec -report-bindings -rf rf_ex_sunpc hostname
[sunpc0:22706] MCW rank 0 bound to socket 0[core 0-1]
  socket 1[core 0-1]: [B B][B B] (slot list 0:0-1,1:0-1)
sunpc0
sunpc1
[sunpc1:25189] MCW rank 1 bound to socket 0[core 0-1]:
  [B B][. .] (slot list 0:0-1)
sunpc1
[sunpc1:25189] MCW rank 2 bound to socket 1[core 0]:
  [. .][B .] (slot list 1:0)
[sunpc1:25189] MCW rank 3 bound to socket 1[core 1]:
  [. .][. B] (slot list 1:1)
sunpc1

tyr rankfiles 102 mpiexec -report-bindings -rf rf_ex_sunpc_linpc hostname
[linpc1:06777] MCW rank 0 bound to socket 0[core 0-1]
  socket 1[core 0-1]: [B B][B B] (slot list 0:0-1,1:0-1)
linpc1
sunpc1
[sunpc1:25226] MCW rank 1 bound to socket 0[core 0-1]:
  [B B][. .] (slot list 0:0-1)
sunpc1
[sunpc1:25226] MCW rank 2 bound to socket 1[core 0]:
  [. .][B .] (slot list 1:0)
[sunpc1:25226] MCW rank 3 bound to socket 1[core 1]:
  [. .][. B] (slot list 1:1)
sunpc1
tyr rankfiles 103 


Kind regards

Siegmar



Re: [OMPI devel] RFC: Remove (broken) heterogeneous support

2013-01-30 Thread Siegmar Gross
Hi

> WHAT: Remove the configure command line option to enable heterogeneous
> support
> 
> WHY: The heterogeneous conversion code isn't working, very few people
> use this feature
> 
> WHERE: README and config/opal_configure_options.m4.  See attached patch.
> 
> TIMEOUT: Next Tuesday teleconf, 5 Feb, 2013
> 
> MORE DETAIL:
> 
> The heterogeneous code has been broken for a while.  The assumption
> is that this is a minor bug that can fairly easily be fixed, but a)
> no one has taken the time to do so, b) very few people use this
> functionality, and c) many OMPI developers don't even have hardware
> where to test this scenario (e.g., big and little endian systems).
> 
> As such, a suggestion was made to remove the --enable-heterogeneous
> configure CLI switch so that users don't try to enable it.  It
> someone ever fixes the heterogeneous code, the configure CLI switch
> can be put back.

I have no problem with the option --enable-heterogeneous, when I build
Open MPI, but Open MPI will not work in a heterogeneous environment
with little and big endian machines, while LAM MPI can handle such
environments. You wanted to solve this problem.

https://svn.open-mpi.org/trac/ompi/ticket/3430

I would appreciate if you wouldn't remove this option.


Kind regards

Siegmar