Looking at this note again: how many procs is spawn_master generating? > On Jan 11, 2017, at 7:39 PM, r...@open-mpi.org wrote: > > Sigh - yet another corner case. Lovely. Will take a poke at it later this > week. Thx for tracking it down > >> On Jan 11, 2017, at 5:27 PM, Gilles Gouaillardet <gil...@rist.or.jp >> <mailto:gil...@rist.or.jp>> wrote: >> >> Ralph, >> >> >> >> so it seems the root cause is a kind of incompatibility between the --host >> and the --slot-list options >> >> >> >> on a single node with two six cores sockets, >> >> this works : >> >> mpirun -np 1 ./spawn_master >> mpirun -np 1 --slot-list 0:0-5,1:0-5 ./spawn_master >> mpirun -np 1 --host motomachi --oversubscribe ./spawn_master >> mpirun -np 1 --slot-list 0:0-5,1:0-5 --host motomachi:12 ./spawn_master >> >> >> this does not work >> >> mpirun -np 1 --host motomachi ./spawn_master # not enough slots available, >> aborts with a user friendly error message >> mpirun -np 1 --slot-list 0:0-5,1:0-5 --host motomachi ./spawn_master # >> various errors sm_segment_attach() fails, a task crashes >> and this ends up with the following error message >> >> At least one pair of MPI processes are unable to reach each other for >> MPI communications. This means that no Open MPI device has indicated >> that it can be used to communicate between these processes. This is >> an error; Open MPI requires that all MPI processes be able to reach >> each other. This error can sometimes be the result of forgetting to >> specify the "self" BTL. >> >> Process 1 ([[15519,2],0]) is on host: motomachi >> Process 2 ([[15519,2],1]) is on host: unknown! >> BTLs attempted: self tcp >> >> mpirun -np 1 --slot-list 0:0-5,1:0-5 --host motomachi:1 ./spawn_master # >> same error as above >> mpirun -np 1 --slot-list 0:0-5,1:0-5 --host motomachi:2 ./spawn_master # >> same error as above >> >> >> for the record, the following command surprisingly works >> >> mpirun -np 1 --slot-list 0:0-5,1:0-5 --host motomachi:3 --mca btl tcp,self >> ./spawn_master >> >> >> >> bottom line, my guess is that when the user specifies the --slot-list and >> the --host options >> *and* there are no default slot numbers to hosts, we should default to using >> the number >> of slots from the slot list. >> (e.g. in this case, defaults to --host motomachi:12 instead of (i guess) >> --host motomachi:1) >> >> >> /* fwiw, i made >> >> https://github.com/open-mpi/ompi/pull/2715 >> <https://github.com/open-mpi/ompi/pull/2715> >> https://github.com/open-mpi/ompi/pull/2715 >> <https://github.com/open-mpi/ompi/pull/2715> >> but these are not the root cause */ >> >> >> >> Cheers, >> >> >> >> Gilles >> >> >> >> >> -------- Forwarded Message -------- >> Subject: Re: [OMPI users] still segmentation fault with openmpi-2.0.2rc3 >> on Linux >> Date: Wed, 11 Jan 2017 20:39:02 +0900 >> From: Gilles Gouaillardet <gilles.gouaillar...@gmail.com> >> <mailto:gilles.gouaillar...@gmail.com> >> Reply-To: Open MPI Users <us...@lists.open-mpi.org> >> <mailto:us...@lists.open-mpi.org> >> To: Open MPI Users <us...@lists.open-mpi.org> >> <mailto:us...@lists.open-mpi.org> >> >> Siegmar, >> >> Your slot list is correct. >> An invalid slot list for your node would be 0:1-7,1:0-7 >> >> /* and since the test requires only 5 tasks, that could even work with such >> an invalid list. >> My vm is single socket with 4 cores, so a 0:0-4 slot list results in an >> unfriendly pmix error */ >> >> Bottom line, your test is correct, and there is a bug in v2.0.x that I will >> investigate from tomorrow >> >> Cheers, >> >> Gilles >> >> On Wednesday, January 11, 2017, Siegmar Gross >> <siegmar.gr...@informatik.hs-fulda.de >> <mailto:siegmar.gr...@informatik.hs-fulda.de>> wrote: >> Hi Gilles, >> >> thank you very much for your help. What does incorrect slot list >> mean? My machine has two 6-core processors so that I specified >> "--slot-list 0:0-5,1:0-5". Does incorrect mean that it isn't >> allowed to specify more slots than available, to specify fewer >> slots than available, or to specify more slots than needed for >> the processes? >> >> >> Kind regards >> >> Siegmar >> >> Am 11.01.2017 um 10:04 schrieb Gilles Gouaillardet: >> Siegmar, >> >> I was able to reproduce the issue on my vm >> (No need for a real heterogeneous cluster here) >> >> I will keep digging tomorrow. >> Note that if you specify an incorrect slot list, MPI_Comm_spawn fails with a >> very unfriendly error message. >> Right now, the 4th spawn'ed task crashes, so this is a different issue >> >> Cheers, >> >> Gilles >> >> r...@open-mpi.org <> wrote: >> I think there is some relevant discussion here: >> https://github.com/open-mpi/ompi/issues/1569 >> <https://github.com/open-mpi/ompi/issues/1569> >> >> It looks like Gilles had (at least at one point) a fix for master when >> enable-heterogeneous, but I don’t know if that was committed. >> >> On Jan 9, 2017, at 8:23 AM, Howard Pritchard <hpprit...@gmail.com <> >> <mailto:hpprit...@gmail.com <>>> wrote: >> >> HI Siegmar, >> >> You have some config parameters I wasn't trying that may have some impact. >> I'll give a try with these parameters. >> >> This should be enough info for now, >> >> Thanks, >> >> Howard >> >> >> 2017-01-09 0:59 GMT-07:00 Siegmar Gross >> <siegmar.gr...@informatik.hs-fulda.de <> >> <mailto:siegmar.gr...@informatik.hs-fulda.de <>>>: >> >> Hi Howard, >> >> I use the following commands to build and install the package. >> ${SYSTEM_ENV} is "Linux" and ${MACHINE_ENV} is "x86_64" for my >> Linux machine. >> >> mkdir openmpi-2.0.2rc3-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc >> cd openmpi-2.0.2rc3-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc >> >> ../openmpi-2.0.2rc3/configure \ >> --prefix=/usr/local/openmpi-2.0.2_64_cc \ >> --libdir=/usr/local/openmpi-2.0.2_64_cc/lib64 \ >> --with-jdk-bindir=/usr/local/jdk1.8.0_66/bin \ >> --with-jdk-headers=/usr/local/jdk1.8.0_66/include \ >> JAVA_HOME=/usr/local/jdk1.8.0_66 \ >> LDFLAGS="-m64 -mt -Wl,-z -Wl,noexecstack" CC="cc" CXX="CC" FC="f95" \ >> CFLAGS="-m64 -mt" CXXFLAGS="-m64" FCFLAGS="-m64" \ >> CPP="cpp" CXXCPP="cpp" \ >> --enable-mpi-cxx \ >> --enable-mpi-cxx-bindings \ >> --enable-cxx-exceptions \ >> --enable-mpi-java \ >> --enable-heterogeneous \ >> --enable-mpi-thread-multiple \ >> --with-hwloc=internal \ >> --without-verbs \ >> --with-wrapper-cflags="-m64 -mt" \ >> --with-wrapper-cxxflags="-m64" \ >> --with-wrapper-fcflags="-m64" \ >> --with-wrapper-ldflags="-mt" \ >> --enable-debug \ >> |& tee log.configure.$SYSTEM_ENV.$MACHINE_ENV.64_cc >> >> make |& tee log.make.$SYSTEM_ENV.$MACHINE_ENV.64_cc >> rm -r /usr/local/openmpi-2.0.2_64_cc.old >> mv /usr/local/openmpi-2.0.2_64_cc /usr/local/openmpi-2.0.2_64_cc.old >> make install |& tee log.make-install.$SYSTEM_ENV.$MACHINE_ENV.64_cc >> make check |& tee log.make-check.$SYSTEM_ENV.$MACHINE_ENV.64_cc >> >> >> I get a different error if I run the program with gdb. >> >> loki spawn 118 gdb /usr/local/openmpi-2.0.2_64_cc/bin/mpiexec >> GNU gdb (GDB; SUSE Linux Enterprise 12) 7.11.1 >> Copyright (C) 2016 Free Software Foundation, Inc. >> License GPLv3+: GNU GPL version 3 or later >> <http://gnu.org/licenses/gpl.html <http://gnu.org/licenses/gpl.html> >> <http://gnu.org/licenses/gpl.html <http://gnu.org/licenses/gpl.html>>> >> This is free software: you are free to change and redistribute it. >> There is NO WARRANTY, to the extent permitted by law. Type "show >> copying" >> and "show warranty" for details. >> This GDB was configured as "x86_64-suse-linux". >> Type "show configuration" for configuration details. >> For bug reporting instructions, please see: >> <http://bugs.opensuse.org/ <http://bugs.opensuse.org/>>. >> Find the GDB manual and other documentation resources online at: >> <http://www.gnu.org/software/gdb/documentation/ >> <http://www.gnu.org/software/gdb/documentation/> >> <http://www.gnu.org/software/gdb/documentation/ >> <http://www.gnu.org/software/gdb/documentation/>>>. >> For help, type "help". >> Type "apropos word" to search for commands related to "word"... >> Reading symbols from /usr/local/openmpi-2.0.2_64_cc/bin/mpiexec...done. >> (gdb) r -np 1 --host loki --slot-list 0:0-5,1:0-5 spawn_master >> Starting program: /usr/local/openmpi-2.0.2_64_cc/bin/mpiexec -np 1 >> --host loki --slot-list 0:0-5,1:0-5 spawn_master >> Missing separate debuginfos, use: zypper install >> glibc-debuginfo-2.24-2.3.x86_64 >> [Thread debugging using libthread_db enabled] >> Using host libthread_db library "/lib64/libthread_db.so.1". >> [New Thread 0x7ffff3b97700 (LWP 13582)] >> [New Thread 0x7ffff18a4700 (LWP 13583)] >> [New Thread 0x7ffff10a3700 (LWP 13584)] >> [New Thread 0x7fffebbba700 (LWP 13585)] >> Detaching after fork from child process 13586. >> >> Parent process 0 running on loki >> I create 4 slave processes >> >> Detaching after fork from child process 13589. >> Detaching after fork from child process 13590. >> Detaching after fork from child process 13591. >> [loki:13586] OPAL ERROR: Timeout in file >> ../../../../openmpi-2.0.2rc3/opal/mca/pmix/base/pmix_base_fns.c at line 193 >> [loki:13586] *** An error occurred in MPI_Comm_spawn >> [loki:13586] *** reported by process [2873294849,0] >> [loki:13586] *** on communicator MPI_COMM_WORLD >> [loki:13586] *** MPI_ERR_UNKNOWN: unknown error >> [loki:13586] *** MPI_ERRORS_ARE_FATAL (processes in this communicator >> will now abort, >> [loki:13586] *** and potentially your MPI job) >> [Thread 0x7fffebbba700 (LWP 13585) exited] >> [Thread 0x7ffff10a3700 (LWP 13584) exited] >> [Thread 0x7ffff18a4700 (LWP 13583) exited] >> [Thread 0x7ffff3b97700 (LWP 13582) exited] >> [Inferior 1 (process 13567) exited with code 016] >> Missing separate debuginfos, use: zypper install >> libpciaccess0-debuginfo-0.13.2-5.1.x86_64 >> libudev1-debuginfo-210-116.3.3.x86_64 >> (gdb) bt >> No stack. >> (gdb) >> >> Do you need anything else? >> >> >> Kind regards >> >> Siegmar >> >> Am 08.01.2017 um 17:02 schrieb Howard Pritchard: >> >> HI Siegmar, >> >> Could you post the configury options you use when building the >> 2.0.2rc3? >> Maybe that will help in trying to reproduce the segfault you are >> observing. >> >> Howard >> >> >> 2017-01-07 2:30 GMT-07:00 Siegmar Gross >> <siegmar.gr...@informatik.hs-fulda.de <> >> <mailto:siegmar.gr...@informatik.hs-fulda.de <>> >> <mailto:siegmar.gr...@informatik.hs-fulda.de <> >> <mailto:siegmar.gr...@informatik.hs-fulda.de <>>>>: >> >> Hi, >> >> I have installed openmpi-2.0.2rc3 on my "SUSE Linux Enterprise >> Server 12 (x86_64)" with Sun C 5.14 and gcc-6.3.0. Unfortunately, >> I still get the same error that I reported for rc2. >> >> I would be grateful, if somebody can fix the problem before >> releasing the final version. Thank you very much for any help >> in advance. >> >> >> Kind regards >> >> Siegmar >> _______________________________________________ >> users mailing list >> us...@lists.open-mpi.org <> <mailto:us...@lists.open-mpi.org <>> >> <mailto:us...@lists.open-mpi.org <> <mailto:us...@lists.open-mpi.org <>>> >> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users> >> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users >> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>> >> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users >> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users> >> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users >> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>>> >> >> >> >> >> _______________________________________________ >> users mailing list >> us...@lists.open-mpi.org <> <mailto:us...@lists.open-mpi.org <>> >> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users> >> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users >> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>> >> >> _______________________________________________ >> users mailing list >> us...@lists.open-mpi.org <> <mailto:us...@lists.open-mpi.org <>> >> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users> >> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users >> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>> >> >> >> _______________________________________________ >> users mailing list >> us...@lists.open-mpi.org <> <mailto:us...@lists.open-mpi.org <>> >> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users> >> >> >> >> _______________________________________________ >> users mailing list >> us...@lists.open-mpi.org <> >> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users> >> >> _______________________________________________ >> users mailing list >> us...@lists.open-mpi.org <> >> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users> >> <Attached Message Part.txt>_______________________________________________ >> devel mailing list >> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> >> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel> > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
_______________________________________________ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel