On Jan 12, 2017, at 8:57 AM, Howard Pritchard <hpprit...@gmail.com
<mailto:hpprit...@gmail.com>> wrote:
Siegmar,
Could you confirm that if you use one of the mpirun arg lists that works for
Gilles that
your test case passes. Something simple like
mpirun -np 1 ./spawn_master
?
Howard
2017-01-11 18:27 GMT-07:00 Gilles Gouaillardet <gil...@rist.or.jp
<mailto:gil...@rist.or.jp>>:
Ralph,
so it seems the root cause is a kind of incompatibility between the --host
and the --slot-list options
on a single node with two six cores sockets,
this works :
mpirun -np 1 ./spawn_master
mpirun -np 1 --slot-list 0:0-5,1:0-5 ./spawn_master
mpirun -np 1 --host motomachi --oversubscribe ./spawn_master
mpirun -np 1 --slot-list 0:0-5,1:0-5 --host motomachi:12 ./spawn_master
this does not work
mpirun -np 1 --host motomachi ./spawn_master # not enough slots available,
aborts with a user friendly error message
mpirun -np 1 --slot-list 0:0-5,1:0-5 --host motomachi ./spawn_master #
various errors sm_segment_attach() fails, a task crashes
and this ends up with the following error message
At least one pair of MPI processes are unable to reach each other for
MPI communications. This means that no Open MPI device has indicated
that it can be used to communicate between these processes. This is
an error; Open MPI requires that all MPI processes be able to reach
each other. This error can sometimes be the result of forgetting to
specify the "self" BTL.
Process 1 ([[15519,2],0]) is on host: motomachi
Process 2 ([[15519,2],1]) is on host: unknown!
BTLs attempted: self tcp
mpirun -np 1 --slot-list 0:0-5,1:0-5 --host motomachi:1 ./spawn_master #
same error as above
mpirun -np 1 --slot-list 0:0-5,1:0-5 --host motomachi:2 ./spawn_master #
same error as above
for the record, the following command surprisingly works
mpirun -np 1 --slot-list 0:0-5,1:0-5 --host motomachi:3 --mca btl tcp,self
./spawn_master
bottom line, my guess is that when the user specifies the --slot-list and
the --host options
*and* there are no default slot numbers to hosts, we should default to
using the number
of slots from the slot list.
(e.g. in this case, defaults to --host motomachi:12 instead of (i guess)
--host motomachi:1)
/* fwiw, i made
https://github.com/open-mpi/ompi/pull/2715
<https://github.com/open-mpi/ompi/pull/2715>
https://github.com/open-mpi/ompi/pull/2715
<https://github.com/open-mpi/ompi/pull/2715>
but these are not the root cause */
Cheers,
Gilles
-------- Forwarded Message --------
Subject: Re: [OMPI users] still segmentation fault with openmpi-2.0.2rc3
on Linux
Date: Wed, 11 Jan 2017 20:39:02 +0900
From: Gilles Gouaillardet <gilles.gouaillar...@gmail.com>
<mailto:gilles.gouaillar...@gmail.com>
Reply-To: Open MPI Users <us...@lists.open-mpi.org>
<mailto:us...@lists.open-mpi.org>
To: Open MPI Users <us...@lists.open-mpi.org>
<mailto:us...@lists.open-mpi.org>
Siegmar,
Your slot list is correct.
An invalid slot list for your node would be 0:1-7,1:0-7
/* and since the test requires only 5 tasks, that could even work with such
an invalid list.
My vm is single socket with 4 cores, so a 0:0-4 slot list results in an
unfriendly pmix error */
Bottom line, your test is correct, and there is a bug in v2.0.x that I will
investigate from tomorrow
Cheers,
Gilles
On Wednesday, January 11, 2017, Siegmar Gross <siegmar.gr...@informatik.hs-fulda.de
<mailto:siegmar.gr...@informatik.hs-fulda.de>> wrote:
Hi Gilles,
thank you very much for your help. What does incorrect slot list
mean? My machine has two 6-core processors so that I specified
"--slot-list 0:0-5,1:0-5". Does incorrect mean that it isn't
allowed to specify more slots than available, to specify fewer
slots than available, or to specify more slots than needed for
the processes?
Kind regards
Siegmar
Am 11.01.2017 um 10:04 schrieb Gilles Gouaillardet:
Siegmar,
I was able to reproduce the issue on my vm
(No need for a real heterogeneous cluster here)
I will keep digging tomorrow.
Note that if you specify an incorrect slot list, MPI_Comm_spawn
fails with a very unfriendly error message.
Right now, the 4th spawn'ed task crashes, so this is a different
issue
Cheers,
Gilles
r...@open-mpi.org wrote:
I think there is some relevant discussion here:
https://github.com/open-mpi/ompi/issues/1569
<https://github.com/open-mpi/ompi/issues/1569>
It looks like Gilles had (at least at one point) a fix for master
when enable-heterogeneous, but I don’t know if that was committed.
On Jan 9, 2017, at 8:23 AM, Howard Pritchard <hpprit...@gmail.com
<mailto:hpprit...@gmail.com>> wrote:
HI Siegmar,
You have some config parameters I wasn't trying that may have
some impact.
I'll give a try with these parameters.
This should be enough info for now,
Thanks,
Howard
2017-01-09 0:59 GMT-07:00 Siegmar Gross
<siegmar.gr...@informatik.hs-fulda.de
<mailto:siegmar.gr...@informatik.hs-fulda.de>>:
Hi Howard,
I use the following commands to build and install the
package.
${SYSTEM_ENV} is "Linux" and ${MACHINE_ENV} is "x86_64" for
my
Linux machine.
mkdir openmpi-2.0.2rc3-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc
cd openmpi-2.0.2rc3-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc
../openmpi-2.0.2rc3/configure \
--prefix=/usr/local/openmpi-2.0.2_64_cc \
--libdir=/usr/local/openmpi-2.0.2_64_cc/lib64 \
--with-jdk-bindir=/usr/local/jdk1.8.0_66/bin \
--with-jdk-headers=/usr/local/jdk1.8.0_66/include \
JAVA_HOME=/usr/local/jdk1.8.0_66 \
LDFLAGS="-m64 -mt -Wl,-z -Wl,noexecstack" CC="cc" CXX="CC"
FC="f95" \
CFLAGS="-m64 -mt" CXXFLAGS="-m64" FCFLAGS="-m64" \
CPP="cpp" CXXCPP="cpp" \
--enable-mpi-cxx \
--enable-mpi-cxx-bindings \
--enable-cxx-exceptions \
--enable-mpi-java \
--enable-heterogeneous \
--enable-mpi-thread-multiple \
--with-hwloc=internal \
--without-verbs \
--with-wrapper-cflags="-m64 -mt" \
--with-wrapper-cxxflags="-m64" \
--with-wrapper-fcflags="-m64" \
--with-wrapper-ldflags="-mt" \
--enable-debug \
|& tee log.configure.$SYSTEM_ENV.$MACHINE_ENV.64_cc
make |& tee log.make.$SYSTEM_ENV.$MACHINE_ENV.64_cc
rm -r /usr/local/openmpi-2.0.2_64_cc.old
mv /usr/local/openmpi-2.0.2_64_cc
/usr/local/openmpi-2.0.2_64_cc.old
make install |& tee
log.make-install.$SYSTEM_ENV.$MACHINE_ENV.64_cc
make check |& tee
log.make-check.$SYSTEM_ENV.$MACHINE_ENV.64_cc
I get a different error if I run the program with gdb.
loki spawn 118 gdb
/usr/local/openmpi-2.0.2_64_cc/bin/mpiexec
GNU gdb (GDB; SUSE Linux Enterprise 12) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later
<http://gnu.org/licenses/gpl.html <http://gnu.org/licenses/gpl.html>
<http://gnu.org/licenses/gpl.html
<http://gnu.org/licenses/gpl.html>>>
This is free software: you are free to change and
redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type
"show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-suse-linux".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://bugs.opensuse.org/>.
Find the GDB manual and other documentation resources
online at:
<http://www.gnu.org/software/gdb/documentation/
<http://www.gnu.org/software/gdb/documentation/>
<http://www.gnu.org/software/gdb/documentation/
<http://www.gnu.org/software/gdb/documentation/>>>.
For help, type "help".
Type "apropos word" to search for commands related to
"word"...
Reading symbols from
/usr/local/openmpi-2.0.2_64_cc/bin/mpiexec...done.
(gdb) r -np 1 --host loki --slot-list 0:0-5,1:0-5
spawn_master
Starting program:
/usr/local/openmpi-2.0.2_64_cc/bin/mpiexec -np 1 --host loki --slot-list
0:0-5,1:0-5 spawn_master
Missing separate debuginfos, use: zypper install
glibc-debuginfo-2.24-2.3.x86_64
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x7ffff3b97700 (LWP 13582)]
[New Thread 0x7ffff18a4700 (LWP 13583)]
[New Thread 0x7ffff10a3700 (LWP 13584)]
[New Thread 0x7fffebbba700 (LWP 13585)]
Detaching after fork from child process 13586.
Parent process 0 running on loki
I create 4 slave processes
Detaching after fork from child process 13589.
Detaching after fork from child process 13590.
Detaching after fork from child process 13591.
[loki:13586] OPAL ERROR: Timeout in file
../../../../openmpi-2.0.2rc3/opal/mca/pmix/base/pmix_base_fns.c at line 193
[loki:13586] *** An error occurred in MPI_Comm_spawn
[loki:13586] *** reported by process [2873294849,0]
[loki:13586] *** on communicator MPI_COMM_WORLD
[loki:13586] *** MPI_ERR_UNKNOWN: unknown error
[loki:13586] *** MPI_ERRORS_ARE_FATAL (processes in this
communicator will now abort,
[loki:13586] *** and potentially your MPI job)
[Thread 0x7fffebbba700 (LWP 13585) exited]
[Thread 0x7ffff10a3700 (LWP 13584) exited]
[Thread 0x7ffff18a4700 (LWP 13583) exited]
[Thread 0x7ffff3b97700 (LWP 13582) exited]
[Inferior 1 (process 13567) exited with code 016]
Missing separate debuginfos, use: zypper install
libpciaccess0-debuginfo-0.13.2-5.1.x86_64 libudev1-debuginfo-210-116.3.3.x86_64
(gdb) bt
No stack.
(gdb)
Do you need anything else?
Kind regards
Siegmar
Am 08.01.2017 um 17:02 schrieb Howard Pritchard:
HI Siegmar,
Could you post the configury options you use when
building the 2.0.2rc3?
Maybe that will help in trying to reproduce the
segfault you are observing.
Howard
2017-01-07 2:30 GMT-07:00 Siegmar Gross
<siegmar.gr...@informatik.hs-fulda.de
<mailto:siegmar.gr...@informatik.hs-fulda.de>
<mailto:siegmar.gr...@informatik.hs-fulda.de
<mailto:siegmar.gr...@informatik.hs-fulda.de>>>:
Hi,
I have installed openmpi-2.0.2rc3 on my "SUSE Linux
Enterprise
Server 12 (x86_64)" with Sun C 5.14 and gcc-6.3.0.
Unfortunately,
I still get the same error that I reported for rc2.
I would be grateful, if somebody can fix the
problem before
releasing the final version. Thank you very much
for any help
in advance.
Kind regards
Siegmar
_______________________________________________
users mailing list
us...@lists.open-mpi.org <mailto:us...@lists.open-mpi.org>
<mailto:us...@lists.open-mpi.org <mailto:us...@lists.open-mpi.org>>
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
<https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
<https://rfd.newmexicoconsortium.org/mailman/listinfo/users
<https://rfd.newmexicoconsortium.org/mailman/listinfo/users>>
<https://rfd.newmexicoconsortium.org/mailman/listinfo/users
<https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
<https://rfd.newmexicoconsortium.org/mailman/listinfo/users
<https://rfd.newmexicoconsortium.org/mailman/listinfo/users>>>
_______________________________________________
users mailing list
us...@lists.open-mpi.org
<mailto:us...@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
<https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
<https://rfd.newmexicoconsortium.org/mailman/listinfo/users
<https://rfd.newmexicoconsortium.org/mailman/listinfo/users>>
_______________________________________________
users mailing list
us...@lists.open-mpi.org <mailto:us...@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
<https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
<https://rfd.newmexicoconsortium.org/mailman/listinfo/users
<https://rfd.newmexicoconsortium.org/mailman/listinfo/users>>
_______________________________________________
users mailing list
us...@lists.open-mpi.org <mailto:us...@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
<https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
us...@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
<https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
us...@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
<https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
devel mailing list
devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
_______________________________________________
devel mailing list
devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel