Thanks Siegmar. I just wanted to confirm you weren't having some other issue besides the host and slot-list problems.
Howard 2017-01-12 23:50 GMT-07:00 Siegmar Gross < siegmar.gr...@informatik.hs-fulda.de>: > Hi Howard and Gilles, > > thank you very much for your help. All commands that work for > Gilles work also on my machine as expected and the commands that > don't work on his machine don't work on my neither. The first one > that works with both --slot-list and --host is the following > command so that it seems that the value depends on the number of > processes in the remote group. > > loki spawn 122 mpirun -np 1 --slot-list 0:0-5,1:0-5 --host loki:3 > spawn_master > > Parent process 0 running on loki > I create 4 slave processes > > Parent process 0: tasks in MPI_COMM_WORLD: 1 > tasks in COMM_CHILD_PROCESSES local group: 1 > tasks in COMM_CHILD_PROCESSES remote group: 3 > > Slave process 0 of 3 running on loki > spawn_slave 0: argv[0]: spawn_slave > Slave process 1 of 3 running on loki > spawn_slave 1: argv[0]: spawn_slave > Slave process 2 of 3 running on loki > spawn_slave 2: argv[0]: spawn_slave > loki spawn 123 > > > Here is the output from the other commands. > > loki spawn 112 mpirun -np 1 spawn_master > > Parent process 0 running on loki > I create 4 slave processes > > Parent process 0: tasks in MPI_COMM_WORLD: 1 > tasks in COMM_CHILD_PROCESSES local group: 1 > tasks in COMM_CHILD_PROCESSES remote group: 4 > > Slave process 1 of 4 running on loki > Slave process 2 of 4 running on loki > Slave process 3 of 4 running on loki > Slave process 0 of 4 running on loki > spawn_slave 3: argv[0]: spawn_slave > spawn_slave 1: argv[0]: spawn_slave > spawn_slave 2: argv[0]: spawn_slave > spawn_slave 0: argv[0]: spawn_slave > loki spawn 113 mpirun -np 1 --slot-list 0:0-5,1:0-5 spawn_master > > Parent process 0 running on loki > I create 4 slave processes > > Slave process 0 of 4 running on loki > Slave process 1 of 4 running on loki > Slave process 2 of 4 running on loki > spawn_slave 2: argv[0]: spawn_slave > Slave process 3 of 4 running on loki > spawn_slave 3: argv[0]: spawn_slave > spawn_slave 0: argv[0]: spawn_slave > spawn_slave 1: argv[0]: spawn_slave > Parent process 0: tasks in MPI_COMM_WORLD: 1 > tasks in COMM_CHILD_PROCESSES local group: 1 > tasks in COMM_CHILD_PROCESSES remote group: 4 > > loki spawn 114 mpirun -np 1 --host loki --oversubscribe spawn_master > > Parent process 0 running on loki > I create 4 slave processes > > Slave process 0 of 4 running on loki > Slave process 1 of 4 running on loki > Slave process 2 of 4 running on loki > spawn_slave 2: argv[0]: spawn_slave > Slave process 3 of 4 running on loki > spawn_slave 3: argv[0]: spawn_slave > spawn_slave 0: argv[0]: spawn_slave > spawn_slave 1: argv[0]: spawn_slave > Parent process 0: tasks in MPI_COMM_WORLD: 1 > tasks in COMM_CHILD_PROCESSES local group: 1 > tasks in COMM_CHILD_PROCESSES remote group: 4 > > loki spawn 115 mpirun -np 1 --slot-list 0:0-5,1:0-5 --host loki:12 > spawn_master > > Parent process 0 running on loki > I create 4 slave processes > > Slave process 0 of 4 running on loki > Slave process 2 of 4 running on loki > Slave process 1 of 4 running on loki > Slave process 3 of 4 running on loki > Parent process 0: tasks in MPI_COMM_WORLD: 1 > tasks in COMM_CHILD_PROCESSES local group: 1 > tasks in COMM_CHILD_PROCESSES remote group: 4 > > spawn_slave 2: argv[0]: spawn_slave > spawn_slave 0: argv[0]: spawn_slave > spawn_slave 1: argv[0]: spawn_slave > spawn_slave 3: argv[0]: spawn_slave > loki spawn 116 mpirun -np 1 --host loki:12 --slot-list 0:0-5,1:0-5 > spawn_master > > Parent process 0 running on loki > I create 4 slave processes > > Slave process 0 of 4 running on loki > Slave process 1 of 4 running on loki > Slave process 2 of 4 running on loki > spawn_slave 2: argv[0]: spawn_slave > Slave process 3 of 4 running on loki > spawn_slave 3: argv[0]: spawn_slave > spawn_slave 0: argv[0]: spawn_slave > spawn_slave 1: argv[0]: spawn_slave > Parent process 0: tasks in MPI_COMM_WORLD: 1 > tasks in COMM_CHILD_PROCESSES local group: 1 > tasks in COMM_CHILD_PROCESSES remote group: 4 > > loki spawn 117 > > > Kind regards > > Siegmar > > Am 12.01.2017 um 22:25 schrieb r...@open-mpi.org: > >> Fix is pending here: https://github.com/open-mpi/ompi/pull/2730 >> >> On Jan 12, 2017, at 8:57 AM, Howard Pritchard <hpprit...@gmail.com >>> <mailto:hpprit...@gmail.com>> wrote: >>> >>> Siegmar, >>> >>> Could you confirm that if you use one of the mpirun arg lists that works >>> for Gilles that >>> your test case passes. Something simple like >>> >>> mpirun -np 1 ./spawn_master >>> >>> ? >>> >>> Howard >>> >>> >>> >>> >>> 2017-01-11 18:27 GMT-07:00 Gilles Gouaillardet <gil...@rist.or.jp >>> <mailto:gil...@rist.or.jp>>: >>> >>> >>> Ralph, >>> >>> >>> so it seems the root cause is a kind of incompatibility between the >>> --host and the --slot-list options >>> >>> >>> on a single node with two six cores sockets, >>> >>> this works : >>> >>> mpirun -np 1 ./spawn_master >>> mpirun -np 1 --slot-list 0:0-5,1:0-5 ./spawn_master >>> mpirun -np 1 --host motomachi --oversubscribe ./spawn_master >>> mpirun -np 1 --slot-list 0:0-5,1:0-5 --host motomachi:12 >>> ./spawn_master >>> >>> >>> this does not work >>> >>> mpirun -np 1 --host motomachi ./spawn_master # not enough slots >>> available, aborts with a user friendly error message >>> mpirun -np 1 --slot-list 0:0-5,1:0-5 --host motomachi ./spawn_master >>> # various errors sm_segment_attach() fails, a task crashes >>> and this ends up with the following error message >>> >>> At least one pair of MPI processes are unable to reach each other for >>> MPI communications. This means that no Open MPI device has indicated >>> that it can be used to communicate between these processes. This is >>> an error; Open MPI requires that all MPI processes be able to reach >>> each other. This error can sometimes be the result of forgetting to >>> specify the "self" BTL. >>> >>> Process 1 ([[15519,2],0]) is on host: motomachi >>> Process 2 ([[15519,2],1]) is on host: unknown! >>> BTLs attempted: self tcp >>> >>> mpirun -np 1 --slot-list 0:0-5,1:0-5 --host motomachi:1 >>> ./spawn_master # same error as above >>> mpirun -np 1 --slot-list 0:0-5,1:0-5 --host motomachi:2 >>> ./spawn_master # same error as above >>> >>> >>> for the record, the following command surprisingly works >>> >>> mpirun -np 1 --slot-list 0:0-5,1:0-5 --host motomachi:3 --mca btl >>> tcp,self ./spawn_master >>> >>> >>> >>> bottom line, my guess is that when the user specifies the >>> --slot-list and the --host options >>> *and* there are no default slot numbers to hosts, we should default >>> to using the number >>> of slots from the slot list. >>> (e.g. in this case, defaults to --host motomachi:12 instead of (i >>> guess) --host motomachi:1) >>> >>> >>> /* fwiw, i made >>> >>> https://github.com/open-mpi/ompi/pull/2715 < >>> https://github.com/open-mpi/ompi/pull/2715> >>> >>> https://github.com/open-mpi/ompi/pull/2715 < >>> https://github.com/open-mpi/ompi/pull/2715> >>> >>> but these are not the root cause */ >>> >>> >>> Cheers, >>> >>> >>> Gilles >>> >>> >>> >>> -------- Forwarded Message -------- >>> Subject: Re: [OMPI users] still segmentation fault with >>> openmpi-2.0.2rc3 on Linux >>> Date: Wed, 11 Jan 2017 20:39:02 +0900 >>> From: Gilles Gouaillardet <gilles.gouaillar...@gmail.com> >>> <mailto:gilles.gouaillar...@gmail.com> >>> Reply-To: Open MPI Users <us...@lists.open-mpi.org> <mailto: >>> us...@lists.open-mpi.org> >>> To: Open MPI Users <us...@lists.open-mpi.org> <mailto: >>> us...@lists.open-mpi.org> >>> >>> >>> >>> Siegmar, >>> >>> Your slot list is correct. >>> An invalid slot list for your node would be 0:1-7,1:0-7 >>> >>> /* and since the test requires only 5 tasks, that could even work >>> with such an invalid list. >>> My vm is single socket with 4 cores, so a 0:0-4 slot list results in >>> an unfriendly pmix error */ >>> >>> Bottom line, your test is correct, and there is a bug in v2.0.x that >>> I will investigate from tomorrow >>> >>> Cheers, >>> >>> Gilles >>> >>> On Wednesday, January 11, 2017, Siegmar Gross < >>> siegmar.gr...@informatik.hs-fulda.de <mailto:siegmar.gross@informat >>> ik.hs-fulda.de>> wrote: >>> >>> Hi Gilles, >>> >>> thank you very much for your help. What does incorrect slot list >>> mean? My machine has two 6-core processors so that I specified >>> "--slot-list 0:0-5,1:0-5". Does incorrect mean that it isn't >>> allowed to specify more slots than available, to specify fewer >>> slots than available, or to specify more slots than needed for >>> the processes? >>> >>> >>> Kind regards >>> >>> Siegmar >>> >>> Am 11.01.2017 um 10:04 schrieb Gilles Gouaillardet: >>> >>> Siegmar, >>> >>> I was able to reproduce the issue on my vm >>> (No need for a real heterogeneous cluster here) >>> >>> I will keep digging tomorrow. >>> Note that if you specify an incorrect slot list, >>> MPI_Comm_spawn fails with a very unfriendly error message. >>> Right now, the 4th spawn'ed task crashes, so this is a >>> different issue >>> >>> Cheers, >>> >>> Gilles >>> >>> r...@open-mpi.org wrote: >>> I think there is some relevant discussion here: >>> https://github.com/open-mpi/ompi/issues/1569 < >>> https://github.com/open-mpi/ompi/issues/1569> >>> >>> >>> It looks like Gilles had (at least at one point) a fix for >>> master when enable-heterogeneous, but I don’t know if that was committed. >>> >>> On Jan 9, 2017, at 8:23 AM, Howard Pritchard < >>> hpprit...@gmail.com <mailto:hpprit...@gmail.com>> wrote: >>> >>> HI Siegmar, >>> >>> You have some config parameters I wasn't trying that may >>> have some impact. >>> I'll give a try with these parameters. >>> >>> This should be enough info for now, >>> >>> Thanks, >>> >>> Howard >>> >>> >>> 2017-01-09 0:59 GMT-07:00 Siegmar Gross < >>> siegmar.gr...@informatik.hs-fulda.de <mailto:siegmar.gross@informat >>> ik.hs-fulda.de>>: >>> >>> Hi Howard, >>> >>> I use the following commands to build and install >>> the package. >>> ${SYSTEM_ENV} is "Linux" and ${MACHINE_ENV} is >>> "x86_64" for my >>> Linux machine. >>> >>> mkdir openmpi-2.0.2rc3-${SYSTEM_ENV} >>> .${MACHINE_ENV}.64_cc >>> cd openmpi-2.0.2rc3-${SYSTEM_ENV} >>> .${MACHINE_ENV}.64_cc >>> >>> ../openmpi-2.0.2rc3/configure \ >>> --prefix=/usr/local/openmpi-2.0.2_64_cc \ >>> --libdir=/usr/local/openmpi-2.0.2_64_cc/lib64 \ >>> --with-jdk-bindir=/usr/local/jdk1.8.0_66/bin \ >>> --with-jdk-headers=/usr/local/jdk1.8.0_66/include >>> \ >>> JAVA_HOME=/usr/local/jdk1.8.0_66 \ >>> LDFLAGS="-m64 -mt -Wl,-z -Wl,noexecstack" CC="cc" >>> CXX="CC" FC="f95" \ >>> CFLAGS="-m64 -mt" CXXFLAGS="-m64" FCFLAGS="-m64" \ >>> CPP="cpp" CXXCPP="cpp" \ >>> --enable-mpi-cxx \ >>> --enable-mpi-cxx-bindings \ >>> --enable-cxx-exceptions \ >>> --enable-mpi-java \ >>> --enable-heterogeneous \ >>> --enable-mpi-thread-multiple \ >>> --with-hwloc=internal \ >>> --without-verbs \ >>> --with-wrapper-cflags="-m64 -mt" \ >>> --with-wrapper-cxxflags="-m64" \ >>> --with-wrapper-fcflags="-m64" \ >>> --with-wrapper-ldflags="-mt" \ >>> --enable-debug \ >>> |& tee log.configure.$SYSTEM_ENV.$MAC >>> HINE_ENV.64_cc >>> >>> make |& tee log.make.$SYSTEM_ENV.$MACHINE_ENV.64_cc >>> rm -r /usr/local/openmpi-2.0.2_64_cc.old >>> mv /usr/local/openmpi-2.0.2_64_cc >>> /usr/local/openmpi-2.0.2_64_cc.old >>> make install |& tee log.make-install.$SYSTEM_ENV.$ >>> MACHINE_ENV.64_cc >>> make check |& tee log.make-check.$SYSTEM_ENV.$MA >>> CHINE_ENV.64_cc >>> >>> >>> I get a different error if I run the program with >>> gdb. >>> >>> loki spawn 118 gdb /usr/local/openmpi-2.0.2_64_cc >>> /bin/mpiexec >>> GNU gdb (GDB; SUSE Linux Enterprise 12) 7.11.1 >>> Copyright (C) 2016 Free Software Foundation, Inc. >>> License GPLv3+: GNU GPL version 3 or later < >>> http://gnu.org/licenses/gpl.html <http://gnu.org/licenses/gpl.html> >>> <http://gnu.org/licenses/gpl.html < >>> http://gnu.org/licenses/gpl.html>>> >>> This is free software: you are free to change and >>> redistribute it. >>> There is NO WARRANTY, to the extent permitted by >>> law. Type "show copying" >>> and "show warranty" for details. >>> This GDB was configured as "x86_64-suse-linux". >>> Type "show configuration" for configuration details. >>> For bug reporting instructions, please see: >>> <http://bugs.opensuse.org/>. >>> Find the GDB manual and other documentation >>> resources online at: >>> <http://www.gnu.org/software/gdb/documentation/ < >>> http://www.gnu.org/software/gdb/documentation/> >>> <http://www.gnu.org/software/gdb/documentation/ < >>> http://www.gnu.org/software/gdb/documentation/>>>. >>> For help, type "help". >>> Type "apropos word" to search for commands related >>> to "word"... >>> Reading symbols from /usr/local/openmpi-2.0.2_64_cc >>> /bin/mpiexec...done. >>> (gdb) r -np 1 --host loki --slot-list 0:0-5,1:0-5 >>> spawn_master >>> Starting program: >>> /usr/local/openmpi-2.0.2_64_cc/bin/mpiexec >>> -np 1 --host loki --slot-list 0:0-5,1:0-5 spawn_master >>> Missing separate debuginfos, use: zypper install >>> glibc-debuginfo-2.24-2.3.x86_64 >>> [Thread debugging using libthread_db enabled] >>> Using host libthread_db library >>> "/lib64/libthread_db.so.1". >>> [New Thread 0x7ffff3b97700 (LWP 13582)] >>> [New Thread 0x7ffff18a4700 (LWP 13583)] >>> [New Thread 0x7ffff10a3700 (LWP 13584)] >>> [New Thread 0x7fffebbba700 (LWP 13585)] >>> Detaching after fork from child process 13586. >>> >>> Parent process 0 running on loki >>> I create 4 slave processes >>> >>> Detaching after fork from child process 13589. >>> Detaching after fork from child process 13590. >>> Detaching after fork from child process 13591. >>> [loki:13586] OPAL ERROR: Timeout in file >>> ../../../../openmpi-2.0.2rc3/opal/mca/pmix/base/pmix_base_fns.c at line >>> 193 >>> [loki:13586] *** An error occurred in MPI_Comm_spawn >>> [loki:13586] *** reported by process [2873294849,0] >>> [loki:13586] *** on communicator MPI_COMM_WORLD >>> [loki:13586] *** MPI_ERR_UNKNOWN: unknown error >>> [loki:13586] *** MPI_ERRORS_ARE_FATAL (processes in >>> this communicator will now abort, >>> [loki:13586] *** and potentially your MPI job) >>> [Thread 0x7fffebbba700 (LWP 13585) exited] >>> [Thread 0x7ffff10a3700 (LWP 13584) exited] >>> [Thread 0x7ffff18a4700 (LWP 13583) exited] >>> [Thread 0x7ffff3b97700 (LWP 13582) exited] >>> [Inferior 1 (process 13567) exited with code 016] >>> Missing separate debuginfos, use: zypper install >>> libpciaccess0-debuginfo-0.13.2-5.1.x86_64 libudev1-debuginfo-210-116.3.3 >>> .x86_64 >>> (gdb) bt >>> No stack. >>> (gdb) >>> >>> Do you need anything else? >>> >>> >>> Kind regards >>> >>> Siegmar >>> >>> Am 08.01.2017 um 17:02 schrieb Howard Pritchard: >>> >>> HI Siegmar, >>> >>> Could you post the configury options you use >>> when building the 2.0.2rc3? >>> Maybe that will help in trying to reproduce the >>> segfault you are observing. >>> >>> Howard >>> >>> >>> 2017-01-07 2:30 GMT-07:00 Siegmar Gross < >>> siegmar.gr...@informatik.hs-fulda.de <mailto:siegmar.gross@informat >>> ik.hs-fulda.de> >>> <mailto:siegmar.gr...@informatik.hs-fulda.de >>> <mailto:siegmar.gr...@informatik.hs-fulda.de>>>: >>> >>> Hi, >>> >>> I have installed openmpi-2.0.2rc3 on my >>> "SUSE Linux Enterprise >>> Server 12 (x86_64)" with Sun C 5.14 and >>> gcc-6.3.0. Unfortunately, >>> I still get the same error that I reported >>> for rc2. >>> >>> I would be grateful, if somebody can fix the >>> problem before >>> releasing the final version. Thank you very >>> much for any help >>> in advance. >>> >>> >>> Kind regards >>> >>> Siegmar >>> ______________________________ >>> _________________ >>> users mailing list >>> us...@lists.open-mpi.org <mailto: >>> us...@lists.open-mpi.org> <mailto:us...@lists.open-mpi.org <mailto: >>> us...@lists.open-mpi.org>> >>> https://rfd.newmexicoconsortiu >>> m.org/mailman/listinfo/users <https://rfd.newmexicoconsorti >>> um.org/mailman/listinfo/users> >>> <https://rfd.newmexicoconsorti >>> um.org/mailman/listinfo/users <https://rfd.newmexicoconsorti >>> um.org/mailman/listinfo/users>> >>> <https://rfd.newmexicoconsorti >>> um.org/mailman/listinfo/users <https://rfd.newmexicoconsorti >>> um.org/mailman/listinfo/users> >>> <https://rfd.newmexicoconsorti >>> um.org/mailman/listinfo/users <https://rfd.newmexicoconsorti >>> um.org/mailman/listinfo/users>>> >>> >>> >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@lists.open-mpi.org <mailto: >>> us...@lists.open-mpi.org> >>> https://rfd.newmexicoconsortiu >>> m.org/mailman/listinfo/users <https://rfd.newmexicoconsorti >>> um.org/mailman/listinfo/users> >>> <https://rfd.newmexicoconsorti >>> um.org/mailman/listinfo/users <https://rfd.newmexicoconsorti >>> um.org/mailman/listinfo/users>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@lists.open-mpi.org <mailto: >>> us...@lists.open-mpi.org> >>> https://rfd.newmexicoconsortiu >>> m.org/mailman/listinfo/users <https://rfd.newmexicoconsorti >>> um.org/mailman/listinfo/users> >>> <https://rfd.newmexicoconsorti >>> um.org/mailman/listinfo/users <https://rfd.newmexicoconsorti >>> um.org/mailman/listinfo/users>> >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@lists.open-mpi.org <mailto:us...@lists.open-mpi.o >>> rg> >>> https://rfd.newmexicoconsortiu >>> m.org/mailman/listinfo/users <https://rfd.newmexicoconsorti >>> um.org/mailman/listinfo/users> >>> >>> >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@lists.open-mpi.org >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users < >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users> >>> >>> _______________________________________________ >>> users mailing list >>> us...@lists.open-mpi.org >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users < >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users> >>> >>> >>> _______________________________________________ >>> devel mailing list >>> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel < >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel> >>> >>> >>> _______________________________________________ >>> devel mailing list >>> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>> >> >> >> >> _______________________________________________ >> devel mailing list >> devel@lists.open-mpi.org >> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >> >> _______________________________________________ > devel mailing list > devel@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >
_______________________________________________ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel