Sigh - yet another corner case. Lovely. Will take a poke at it later this week. Thx for tracking it down
> On Jan 11, 2017, at 5:27 PM, Gilles Gouaillardet <gil...@rist.or.jp> wrote: > > Ralph, > > > > so it seems the root cause is a kind of incompatibility between the --host > and the --slot-list options > > > > on a single node with two six cores sockets, > > this works : > > mpirun -np 1 ./spawn_master > mpirun -np 1 --slot-list 0:0-5,1:0-5 ./spawn_master > mpirun -np 1 --host motomachi --oversubscribe ./spawn_master > mpirun -np 1 --slot-list 0:0-5,1:0-5 --host motomachi:12 ./spawn_master > > > this does not work > > mpirun -np 1 --host motomachi ./spawn_master # not enough slots available, > aborts with a user friendly error message > mpirun -np 1 --slot-list 0:0-5,1:0-5 --host motomachi ./spawn_master # > various errors sm_segment_attach() fails, a task crashes > and this ends up with the following error message > > At least one pair of MPI processes are unable to reach each other for > MPI communications. This means that no Open MPI device has indicated > that it can be used to communicate between these processes. This is > an error; Open MPI requires that all MPI processes be able to reach > each other. This error can sometimes be the result of forgetting to > specify the "self" BTL. > > Process 1 ([[15519,2],0]) is on host: motomachi > Process 2 ([[15519,2],1]) is on host: unknown! > BTLs attempted: self tcp > > mpirun -np 1 --slot-list 0:0-5,1:0-5 --host motomachi:1 ./spawn_master # same > error as above > mpirun -np 1 --slot-list 0:0-5,1:0-5 --host motomachi:2 ./spawn_master # same > error as above > > > for the record, the following command surprisingly works > > mpirun -np 1 --slot-list 0:0-5,1:0-5 --host motomachi:3 --mca btl tcp,self > ./spawn_master > > > > bottom line, my guess is that when the user specifies the --slot-list and the > --host options > *and* there are no default slot numbers to hosts, we should default to using > the number > of slots from the slot list. > (e.g. in this case, defaults to --host motomachi:12 instead of (i guess) > --host motomachi:1) > > > /* fwiw, i made > > https://github.com/open-mpi/ompi/pull/2715 > <https://github.com/open-mpi/ompi/pull/2715> > https://github.com/open-mpi/ompi/pull/2715 > <https://github.com/open-mpi/ompi/pull/2715> > but these are not the root cause */ > > > > Cheers, > > > > Gilles > > > > > -------- Forwarded Message -------- > Subject: Re: [OMPI users] still segmentation fault with openmpi-2.0.2rc3 > on Linux > Date: Wed, 11 Jan 2017 20:39:02 +0900 > From: Gilles Gouaillardet <gilles.gouaillar...@gmail.com> > <mailto:gilles.gouaillar...@gmail.com> > Reply-To: Open MPI Users <us...@lists.open-mpi.org> > <mailto:us...@lists.open-mpi.org> > To: Open MPI Users <us...@lists.open-mpi.org> > <mailto:us...@lists.open-mpi.org> > > Siegmar, > > Your slot list is correct. > An invalid slot list for your node would be 0:1-7,1:0-7 > > /* and since the test requires only 5 tasks, that could even work with such > an invalid list. > My vm is single socket with 4 cores, so a 0:0-4 slot list results in an > unfriendly pmix error */ > > Bottom line, your test is correct, and there is a bug in v2.0.x that I will > investigate from tomorrow > > Cheers, > > Gilles > > On Wednesday, January 11, 2017, Siegmar Gross > <siegmar.gr...@informatik.hs-fulda.de > <mailto:siegmar.gr...@informatik.hs-fulda.de>> wrote: > Hi Gilles, > > thank you very much for your help. What does incorrect slot list > mean? My machine has two 6-core processors so that I specified > "--slot-list 0:0-5,1:0-5". Does incorrect mean that it isn't > allowed to specify more slots than available, to specify fewer > slots than available, or to specify more slots than needed for > the processes? > > > Kind regards > > Siegmar > > Am 11.01.2017 um 10:04 schrieb Gilles Gouaillardet: > Siegmar, > > I was able to reproduce the issue on my vm > (No need for a real heterogeneous cluster here) > > I will keep digging tomorrow. > Note that if you specify an incorrect slot list, MPI_Comm_spawn fails with a > very unfriendly error message. > Right now, the 4th spawn'ed task crashes, so this is a different issue > > Cheers, > > Gilles > > r...@open-mpi.org <> wrote: > I think there is some relevant discussion here: > https://github.com/open-mpi/ompi/issues/1569 > <https://github.com/open-mpi/ompi/issues/1569> > > It looks like Gilles had (at least at one point) a fix for master when > enable-heterogeneous, but I don’t know if that was committed. > > On Jan 9, 2017, at 8:23 AM, Howard Pritchard <hpprit...@gmail.com <> > <mailto:hpprit...@gmail.com <>>> wrote: > > HI Siegmar, > > You have some config parameters I wasn't trying that may have some impact. > I'll give a try with these parameters. > > This should be enough info for now, > > Thanks, > > Howard > > > 2017-01-09 0:59 GMT-07:00 Siegmar Gross <siegmar.gr...@informatik.hs-fulda.de > <> <mailto:siegmar.gr...@informatik.hs-fulda.de <>>>: > > Hi Howard, > > I use the following commands to build and install the package. > ${SYSTEM_ENV} is "Linux" and ${MACHINE_ENV} is "x86_64" for my > Linux machine. > > mkdir openmpi-2.0.2rc3-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc > cd openmpi-2.0.2rc3-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc > > ../openmpi-2.0.2rc3/configure \ > --prefix=/usr/local/openmpi-2.0.2_64_cc \ > --libdir=/usr/local/openmpi-2.0.2_64_cc/lib64 \ > --with-jdk-bindir=/usr/local/jdk1.8.0_66/bin \ > --with-jdk-headers=/usr/local/jdk1.8.0_66/include \ > JAVA_HOME=/usr/local/jdk1.8.0_66 \ > LDFLAGS="-m64 -mt -Wl,-z -Wl,noexecstack" CC="cc" CXX="CC" FC="f95" \ > CFLAGS="-m64 -mt" CXXFLAGS="-m64" FCFLAGS="-m64" \ > CPP="cpp" CXXCPP="cpp" \ > --enable-mpi-cxx \ > --enable-mpi-cxx-bindings \ > --enable-cxx-exceptions \ > --enable-mpi-java \ > --enable-heterogeneous \ > --enable-mpi-thread-multiple \ > --with-hwloc=internal \ > --without-verbs \ > --with-wrapper-cflags="-m64 -mt" \ > --with-wrapper-cxxflags="-m64" \ > --with-wrapper-fcflags="-m64" \ > --with-wrapper-ldflags="-mt" \ > --enable-debug \ > |& tee log.configure.$SYSTEM_ENV.$MACHINE_ENV.64_cc > > make |& tee log.make.$SYSTEM_ENV.$MACHINE_ENV.64_cc > rm -r /usr/local/openmpi-2.0.2_64_cc.old > mv /usr/local/openmpi-2.0.2_64_cc /usr/local/openmpi-2.0.2_64_cc.old > make install |& tee log.make-install.$SYSTEM_ENV.$MACHINE_ENV.64_cc > make check |& tee log.make-check.$SYSTEM_ENV.$MACHINE_ENV.64_cc > > > I get a different error if I run the program with gdb. > > loki spawn 118 gdb /usr/local/openmpi-2.0.2_64_cc/bin/mpiexec > GNU gdb (GDB; SUSE Linux Enterprise 12) 7.11.1 > Copyright (C) 2016 Free Software Foundation, Inc. > License GPLv3+: GNU GPL version 3 or later > <http://gnu.org/licenses/gpl.html <http://gnu.org/licenses/gpl.html> > <http://gnu.org/licenses/gpl.html <http://gnu.org/licenses/gpl.html>>> > This is free software: you are free to change and redistribute it. > There is NO WARRANTY, to the extent permitted by law. Type "show copying" > and "show warranty" for details. > This GDB was configured as "x86_64-suse-linux". > Type "show configuration" for configuration details. > For bug reporting instructions, please see: > <http://bugs.opensuse.org/ <http://bugs.opensuse.org/>>. > Find the GDB manual and other documentation resources online at: > <http://www.gnu.org/software/gdb/documentation/ > <http://www.gnu.org/software/gdb/documentation/> > <http://www.gnu.org/software/gdb/documentation/ > <http://www.gnu.org/software/gdb/documentation/>>>. > For help, type "help". > Type "apropos word" to search for commands related to "word"... > Reading symbols from /usr/local/openmpi-2.0.2_64_cc/bin/mpiexec...done. > (gdb) r -np 1 --host loki --slot-list 0:0-5,1:0-5 spawn_master > Starting program: /usr/local/openmpi-2.0.2_64_cc/bin/mpiexec -np 1 --host > loki --slot-list 0:0-5,1:0-5 spawn_master > Missing separate debuginfos, use: zypper install > glibc-debuginfo-2.24-2.3.x86_64 > [Thread debugging using libthread_db enabled] > Using host libthread_db library "/lib64/libthread_db.so.1". > [New Thread 0x7ffff3b97700 (LWP 13582)] > [New Thread 0x7ffff18a4700 (LWP 13583)] > [New Thread 0x7ffff10a3700 (LWP 13584)] > [New Thread 0x7fffebbba700 (LWP 13585)] > Detaching after fork from child process 13586. > > Parent process 0 running on loki > I create 4 slave processes > > Detaching after fork from child process 13589. > Detaching after fork from child process 13590. > Detaching after fork from child process 13591. > [loki:13586] OPAL ERROR: Timeout in file > ../../../../openmpi-2.0.2rc3/opal/mca/pmix/base/pmix_base_fns.c at line 193 > [loki:13586] *** An error occurred in MPI_Comm_spawn > [loki:13586] *** reported by process [2873294849,0] > [loki:13586] *** on communicator MPI_COMM_WORLD > [loki:13586] *** MPI_ERR_UNKNOWN: unknown error > [loki:13586] *** MPI_ERRORS_ARE_FATAL (processes in this communicator > will now abort, > [loki:13586] *** and potentially your MPI job) > [Thread 0x7fffebbba700 (LWP 13585) exited] > [Thread 0x7ffff10a3700 (LWP 13584) exited] > [Thread 0x7ffff18a4700 (LWP 13583) exited] > [Thread 0x7ffff3b97700 (LWP 13582) exited] > [Inferior 1 (process 13567) exited with code 016] > Missing separate debuginfos, use: zypper install > libpciaccess0-debuginfo-0.13.2-5.1.x86_64 > libudev1-debuginfo-210-116.3.3.x86_64 > (gdb) bt > No stack. > (gdb) > > Do you need anything else? > > > Kind regards > > Siegmar > > Am 08.01.2017 um 17:02 schrieb Howard Pritchard: > > HI Siegmar, > > Could you post the configury options you use when building the > 2.0.2rc3? > Maybe that will help in trying to reproduce the segfault you are > observing. > > Howard > > > 2017-01-07 2:30 GMT-07:00 Siegmar Gross > <siegmar.gr...@informatik.hs-fulda.de <> > <mailto:siegmar.gr...@informatik.hs-fulda.de <>> > <mailto:siegmar.gr...@informatik.hs-fulda.de <> > <mailto:siegmar.gr...@informatik.hs-fulda.de <>>>>: > > Hi, > > I have installed openmpi-2.0.2rc3 on my "SUSE Linux Enterprise > Server 12 (x86_64)" with Sun C 5.14 and gcc-6.3.0. Unfortunately, > I still get the same error that I reported for rc2. > > I would be grateful, if somebody can fix the problem before > releasing the final version. Thank you very much for any help > in advance. > > > Kind regards > > Siegmar > _______________________________________________ > users mailing list > us...@lists.open-mpi.org <> <mailto:us...@lists.open-mpi.org <>> > <mailto:us...@lists.open-mpi.org <> <mailto:us...@lists.open-mpi.org <>>> > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > <https://rfd.newmexicoconsortium.org/mailman/listinfo/users> > <https://rfd.newmexicoconsortium.org/mailman/listinfo/users > <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>> > <https://rfd.newmexicoconsortium.org/mailman/listinfo/users > <https://rfd.newmexicoconsortium.org/mailman/listinfo/users> > <https://rfd.newmexicoconsortium.org/mailman/listinfo/users > <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>>> > > > > > _______________________________________________ > users mailing list > us...@lists.open-mpi.org <> <mailto:us...@lists.open-mpi.org <>> > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > <https://rfd.newmexicoconsortium.org/mailman/listinfo/users> > <https://rfd.newmexicoconsortium.org/mailman/listinfo/users > <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>> > > _______________________________________________ > users mailing list > us...@lists.open-mpi.org <> <mailto:us...@lists.open-mpi.org <>> > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > <https://rfd.newmexicoconsortium.org/mailman/listinfo/users> > <https://rfd.newmexicoconsortium.org/mailman/listinfo/users > <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>> > > > _______________________________________________ > users mailing list > us...@lists.open-mpi.org <> <mailto:us...@lists.open-mpi.org <>> > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > <https://rfd.newmexicoconsortium.org/mailman/listinfo/users> > > > > _______________________________________________ > users mailing list > us...@lists.open-mpi.org <> > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > <https://rfd.newmexicoconsortium.org/mailman/listinfo/users> > > _______________________________________________ > users mailing list > us...@lists.open-mpi.org <> > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > <https://rfd.newmexicoconsortium.org/mailman/listinfo/users> > <Attached Message Part.txt>_______________________________________________ > devel mailing list > devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel > <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
_______________________________________________ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel