Looking at this note again: how many procs is spawn_master generating?

> On Jan 11, 2017, at 7:39 PM, r...@open-mpi.org wrote:
> 
> Sigh - yet another corner case. Lovely. Will take a poke at it later this 
> week. Thx for tracking it down
> 
>> On Jan 11, 2017, at 5:27 PM, Gilles Gouaillardet <gil...@rist.or.jp 
>> <mailto:gil...@rist.or.jp>> wrote:
>> 
>> Ralph,
>> 
>> 
>> 
>> so it seems the root cause is a kind of incompatibility between the --host 
>> and the --slot-list options
>> 
>> 
>> 
>> on a single node with two six cores sockets, 
>> 
>> this works :
>> 
>> mpirun -np 1 ./spawn_master 
>> mpirun -np 1 --slot-list 0:0-5,1:0-5 ./spawn_master
>> mpirun -np 1 --host motomachi --oversubscribe ./spawn_master 
>> mpirun -np 1 --slot-list 0:0-5,1:0-5 --host motomachi:12 ./spawn_master 
>> 
>> 
>> this does not work
>>  
>> mpirun -np 1 --host motomachi ./spawn_master # not enough slots available, 
>> aborts with a user friendly error message
>> mpirun -np 1 --slot-list 0:0-5,1:0-5 --host motomachi ./spawn_master # 
>> various errors sm_segment_attach() fails, a task crashes
>> and this ends up with the following error message
>> 
>> At least one pair of MPI processes are unable to reach each other for
>> MPI communications.  This means that no Open MPI device has indicated
>> that it can be used to communicate between these processes.  This is
>> an error; Open MPI requires that all MPI processes be able to reach
>> each other.  This error can sometimes be the result of forgetting to
>> specify the "self" BTL.
>> 
>>   Process 1 ([[15519,2],0]) is on host: motomachi
>>   Process 2 ([[15519,2],1]) is on host: unknown!
>>   BTLs attempted: self tcp
>> 
>> mpirun -np 1 --slot-list 0:0-5,1:0-5 --host motomachi:1 ./spawn_master # 
>> same error as above
>> mpirun -np 1 --slot-list 0:0-5,1:0-5 --host motomachi:2 ./spawn_master # 
>> same error as above
>> 
>> 
>> for the record, the following command surprisingly works
>> 
>> mpirun -np 1 --slot-list 0:0-5,1:0-5 --host motomachi:3 --mca btl tcp,self 
>> ./spawn_master
>> 
>> 
>> 
>> bottom line, my guess is that when the user specifies the --slot-list and 
>> the --host options
>> *and* there are no default slot numbers to hosts, we should default to using 
>> the number
>> of slots from the slot list.
>> (e.g. in this case, defaults to --host motomachi:12 instead of (i guess) 
>> --host motomachi:1)
>> 
>> 
>> /* fwiw, i made
>> 
>> https://github.com/open-mpi/ompi/pull/2715 
>> <https://github.com/open-mpi/ompi/pull/2715>
>> https://github.com/open-mpi/ompi/pull/2715 
>> <https://github.com/open-mpi/ompi/pull/2715>
>> but these are not the root cause */
>> 
>> 
>> 
>> Cheers,
>> 
>> 
>> 
>> Gilles
>> 
>> 
>> 
>> 
>> -------- Forwarded Message --------
>> Subject:     Re: [OMPI users] still segmentation fault with openmpi-2.0.2rc3 
>> on Linux
>> Date:        Wed, 11 Jan 2017 20:39:02 +0900
>> From:        Gilles Gouaillardet <gilles.gouaillar...@gmail.com> 
>> <mailto:gilles.gouaillar...@gmail.com>
>> Reply-To:    Open MPI Users <us...@lists.open-mpi.org> 
>> <mailto:us...@lists.open-mpi.org>
>> To:  Open MPI Users <us...@lists.open-mpi.org> 
>> <mailto:us...@lists.open-mpi.org>
>> 
>> Siegmar,
>> 
>> Your slot list is correct.
>> An invalid slot list for your node would be 0:1-7,1:0-7
>> 
>> /* and since the test requires only 5 tasks, that could even work with such 
>> an invalid list.
>> My vm is single socket with 4 cores, so a 0:0-4 slot list results in an 
>> unfriendly pmix error */
>> 
>> Bottom line, your test is correct, and there is a bug in v2.0.x that I will 
>> investigate from tomorrow 
>> 
>> Cheers,
>> 
>> Gilles
>> 
>> On Wednesday, January 11, 2017, Siegmar Gross 
>> <siegmar.gr...@informatik.hs-fulda.de 
>> <mailto:siegmar.gr...@informatik.hs-fulda.de>> wrote:
>> Hi Gilles,
>> 
>> thank you very much for your help. What does incorrect slot list
>> mean? My machine has two 6-core processors so that I specified
>> "--slot-list 0:0-5,1:0-5". Does incorrect mean that it isn't
>> allowed to specify more slots than available, to specify fewer
>> slots than available, or to specify more slots than needed for
>> the processes?
>> 
>> 
>> Kind regards
>> 
>> Siegmar
>> 
>> Am 11.01.2017 um 10:04 schrieb Gilles Gouaillardet:
>> Siegmar,
>> 
>> I was able to reproduce the issue on my vm
>> (No need for a real heterogeneous cluster here)
>> 
>> I will keep digging tomorrow.
>> Note that if you specify an incorrect slot list, MPI_Comm_spawn fails with a 
>> very unfriendly error message.
>> Right now, the 4th spawn'ed task crashes, so this is a different issue
>> 
>> Cheers,
>> 
>> Gilles
>> 
>> r...@open-mpi.org <> wrote:
>> I think there is some relevant discussion here: 
>> https://github.com/open-mpi/ompi/issues/1569 
>> <https://github.com/open-mpi/ompi/issues/1569>
>> 
>> It looks like Gilles had (at least at one point) a fix for master when 
>> enable-heterogeneous, but I don’t know if that was committed.
>> 
>> On Jan 9, 2017, at 8:23 AM, Howard Pritchard <hpprit...@gmail.com <> 
>> <mailto:hpprit...@gmail.com <>>> wrote:
>> 
>> HI Siegmar,
>> 
>> You have some config parameters I wasn't trying that may have some impact.
>> I'll give a try with these parameters.
>> 
>> This should be enough info for now,
>> 
>> Thanks,
>> 
>> Howard
>> 
>> 
>> 2017-01-09 0:59 GMT-07:00 Siegmar Gross 
>> <siegmar.gr...@informatik.hs-fulda.de <> 
>> <mailto:siegmar.gr...@informatik.hs-fulda.de <>>>:
>> 
>>     Hi Howard,
>> 
>>     I use the following commands to build and install the package.
>>     ${SYSTEM_ENV} is "Linux" and ${MACHINE_ENV} is "x86_64" for my
>>     Linux machine.
>> 
>>     mkdir openmpi-2.0.2rc3-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc
>>     cd openmpi-2.0.2rc3-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc
>> 
>>     ../openmpi-2.0.2rc3/configure \
>>       --prefix=/usr/local/openmpi-2.0.2_64_cc \
>>       --libdir=/usr/local/openmpi-2.0.2_64_cc/lib64 \
>>       --with-jdk-bindir=/usr/local/jdk1.8.0_66/bin \
>>       --with-jdk-headers=/usr/local/jdk1.8.0_66/include \
>>       JAVA_HOME=/usr/local/jdk1.8.0_66 \
>>       LDFLAGS="-m64 -mt -Wl,-z -Wl,noexecstack" CC="cc" CXX="CC" FC="f95" \
>>       CFLAGS="-m64 -mt" CXXFLAGS="-m64" FCFLAGS="-m64" \
>>       CPP="cpp" CXXCPP="cpp" \
>>       --enable-mpi-cxx \
>>       --enable-mpi-cxx-bindings \
>>       --enable-cxx-exceptions \
>>       --enable-mpi-java \
>>       --enable-heterogeneous \
>>       --enable-mpi-thread-multiple \
>>       --with-hwloc=internal \
>>       --without-verbs \
>>       --with-wrapper-cflags="-m64 -mt" \
>>       --with-wrapper-cxxflags="-m64" \
>>       --with-wrapper-fcflags="-m64" \
>>       --with-wrapper-ldflags="-mt" \
>>       --enable-debug \
>>       |& tee log.configure.$SYSTEM_ENV.$MACHINE_ENV.64_cc
>> 
>>     make |& tee log.make.$SYSTEM_ENV.$MACHINE_ENV.64_cc
>>     rm -r /usr/local/openmpi-2.0.2_64_cc.old
>>     mv /usr/local/openmpi-2.0.2_64_cc /usr/local/openmpi-2.0.2_64_cc.old
>>     make install |& tee log.make-install.$SYSTEM_ENV.$MACHINE_ENV.64_cc
>>     make check |& tee log.make-check.$SYSTEM_ENV.$MACHINE_ENV.64_cc
>> 
>> 
>>     I get a different error if I run the program with gdb.
>> 
>>     loki spawn 118 gdb /usr/local/openmpi-2.0.2_64_cc/bin/mpiexec
>>     GNU gdb (GDB; SUSE Linux Enterprise 12) 7.11.1
>>     Copyright (C) 2016 Free Software Foundation, Inc.
>>     License GPLv3+: GNU GPL version 3 or later 
>> <http://gnu.org/licenses/gpl.html <http://gnu.org/licenses/gpl.html> 
>> <http://gnu.org/licenses/gpl.html <http://gnu.org/licenses/gpl.html>>>
>>     This is free software: you are free to change and redistribute it.
>>     There is NO WARRANTY, to the extent permitted by law.  Type "show 
>> copying"
>>     and "show warranty" for details.
>>     This GDB was configured as "x86_64-suse-linux".
>>     Type "show configuration" for configuration details.
>>     For bug reporting instructions, please see:
>>     <http://bugs.opensuse.org/ <http://bugs.opensuse.org/>>.
>>     Find the GDB manual and other documentation resources online at:
>>     <http://www.gnu.org/software/gdb/documentation/ 
>> <http://www.gnu.org/software/gdb/documentation/> 
>> <http://www.gnu.org/software/gdb/documentation/ 
>> <http://www.gnu.org/software/gdb/documentation/>>>.
>>     For help, type "help".
>>     Type "apropos word" to search for commands related to "word"...
>>     Reading symbols from /usr/local/openmpi-2.0.2_64_cc/bin/mpiexec...done.
>>     (gdb) r -np 1 --host loki --slot-list 0:0-5,1:0-5 spawn_master
>>     Starting program: /usr/local/openmpi-2.0.2_64_cc/bin/mpiexec -np 1 
>> --host loki --slot-list 0:0-5,1:0-5 spawn_master
>>     Missing separate debuginfos, use: zypper install 
>> glibc-debuginfo-2.24-2.3.x86_64
>>     [Thread debugging using libthread_db enabled]
>>     Using host libthread_db library "/lib64/libthread_db.so.1".
>>     [New Thread 0x7ffff3b97700 (LWP 13582)]
>>     [New Thread 0x7ffff18a4700 (LWP 13583)]
>>     [New Thread 0x7ffff10a3700 (LWP 13584)]
>>     [New Thread 0x7fffebbba700 (LWP 13585)]
>>     Detaching after fork from child process 13586.
>> 
>>     Parent process 0 running on loki
>>       I create 4 slave processes
>> 
>>     Detaching after fork from child process 13589.
>>     Detaching after fork from child process 13590.
>>     Detaching after fork from child process 13591.
>>     [loki:13586] OPAL ERROR: Timeout in file 
>> ../../../../openmpi-2.0.2rc3/opal/mca/pmix/base/pmix_base_fns.c at line 193
>>     [loki:13586] *** An error occurred in MPI_Comm_spawn
>>     [loki:13586] *** reported by process [2873294849,0]
>>     [loki:13586] *** on communicator MPI_COMM_WORLD
>>     [loki:13586] *** MPI_ERR_UNKNOWN: unknown error
>>     [loki:13586] *** MPI_ERRORS_ARE_FATAL (processes in this communicator 
>> will now abort,
>>     [loki:13586] ***    and potentially your MPI job)
>>     [Thread 0x7fffebbba700 (LWP 13585) exited]
>>     [Thread 0x7ffff10a3700 (LWP 13584) exited]
>>     [Thread 0x7ffff18a4700 (LWP 13583) exited]
>>     [Thread 0x7ffff3b97700 (LWP 13582) exited]
>>     [Inferior 1 (process 13567) exited with code 016]
>>     Missing separate debuginfos, use: zypper install 
>> libpciaccess0-debuginfo-0.13.2-5.1.x86_64 
>> libudev1-debuginfo-210-116.3.3.x86_64
>>     (gdb) bt
>>     No stack.
>>     (gdb)
>> 
>>     Do you need anything else?
>> 
>> 
>>     Kind regards
>> 
>>     Siegmar
>> 
>>     Am 08.01.2017 um 17:02 schrieb Howard Pritchard:
>> 
>>         HI Siegmar,
>> 
>>         Could you post the configury options you use when building the 
>> 2.0.2rc3?
>>         Maybe that will help in trying to reproduce the segfault you are 
>> observing.
>> 
>>         Howard
>> 
>> 
>>         2017-01-07 2:30 GMT-07:00 Siegmar Gross 
>> <siegmar.gr...@informatik.hs-fulda.de <> 
>> <mailto:siegmar.gr...@informatik.hs-fulda.de <>>
>>         <mailto:siegmar.gr...@informatik.hs-fulda.de <> 
>> <mailto:siegmar.gr...@informatik.hs-fulda.de <>>>>:
>> 
>>             Hi,
>> 
>>             I have installed openmpi-2.0.2rc3 on my "SUSE Linux Enterprise
>>             Server 12 (x86_64)" with Sun C 5.14 and gcc-6.3.0. Unfortunately,
>>             I still get the same error that I reported for rc2.
>> 
>>             I would be grateful, if somebody can fix the problem before
>>             releasing the final version. Thank you very much for any help
>>             in advance.
>> 
>> 
>>             Kind regards
>> 
>>             Siegmar
>>             _______________________________________________
>>             users mailing list
>>             us...@lists.open-mpi.org <> <mailto:us...@lists.open-mpi.org <>> 
>> <mailto:us...@lists.open-mpi.org <> <mailto:us...@lists.open-mpi.org <>>>
>>             https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users> 
>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>>
>>         <https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users> 
>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>>>
>> 
>> 
>> 
>> 
>>         _______________________________________________
>>         users mailing list
>>         us...@lists.open-mpi.org <> <mailto:us...@lists.open-mpi.org <>>
>>         https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users> 
>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>>
>> 
>>     _______________________________________________
>>     users mailing list
>>     us...@lists.open-mpi.org <> <mailto:us...@lists.open-mpi.org <>>
>>     https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users> 
>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>>
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@lists.open-mpi.org <> <mailto:us...@lists.open-mpi.org <>>
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
>> 
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@lists.open-mpi.org <>
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
>> 
>> _______________________________________________
>> users mailing list
>> us...@lists.open-mpi.org <>
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
>> <Attached Message Part.txt>_______________________________________________
>> devel mailing list
>> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Reply via email to