Siegmar,

Could you confirm that if you use one of the mpirun arg lists that works
for Gilles that
your test case passes.  Something simple like

mpirun -np 1 ./spawn_master

?

Howard




2017-01-11 18:27 GMT-07:00 Gilles Gouaillardet <gil...@rist.or.jp>:

> Ralph,
>
>
> so it seems the root cause is a kind of incompatibility between the --host
> and the --slot-list options
>
>
> on a single node with two six cores sockets,
> this works :
>
> mpirun -np 1 ./spawn_master
> mpirun -np 1 --slot-list 0:0-5,1:0-5 ./spawn_master
> mpirun -np 1 --host motomachi --oversubscribe ./spawn_master
> mpirun -np 1 --slot-list 0:0-5,1:0-5 --host motomachi:12 ./spawn_master
>
>
> this does not work
>
> mpirun -np 1 --host motomachi ./spawn_master # not enough slots available,
> aborts with a user friendly error message
> mpirun -np 1 --slot-list 0:0-5,1:0-5 --host motomachi ./spawn_master #
> various errors sm_segment_attach() fails, a task crashes
> and this ends up with the following error message
>
> At least one pair of MPI processes are unable to reach each other for
> MPI communications.  This means that no Open MPI device has indicated
> that it can be used to communicate between these processes.  This is
> an error; Open MPI requires that all MPI processes be able to reach
> each other.  This error can sometimes be the result of forgetting to
> specify the "self" BTL.
>
>   Process 1 ([[15519,2],0]) is on host: motomachi
>   Process 2 ([[15519,2],1]) is on host: unknown!
>   BTLs attempted: self tcp
>
> mpirun -np 1 --slot-list 0:0-5,1:0-5 --host motomachi:1 ./spawn_master #
> same error as above
> mpirun -np 1 --slot-list 0:0-5,1:0-5 --host motomachi:2 ./spawn_master #
> same error as above
>
>
> for the record, the following command surprisingly works
>
> mpirun -np 1 --slot-list 0:0-5,1:0-5 --host motomachi:3 --mca btl tcp,self
> ./spawn_master
>
>
>
> bottom line, my guess is that when the user specifies the --slot-list and
> the --host options
> *and* there are no default slot numbers to hosts, we should default to
> using the number
> of slots from the slot list.
> (e.g. in this case, defaults to --host motomachi:12 instead of (i guess)
> --host motomachi:1)
>
>
> /* fwiw, i made
>
> https://github.com/open-mpi/ompi/pull/2715
>
> https://github.com/open-mpi/ompi/pull/2715
>
> but these are not the root cause */
>
>
> Cheers,
>
>
> Gilles
>
>
>
> -------- Forwarded Message --------
> Subject: Re: [OMPI users] still segmentation fault with openmpi-2.0.2rc3
> on Linux
> Date: Wed, 11 Jan 2017 20:39:02 +0900
> From: Gilles Gouaillardet <gilles.gouaillar...@gmail.com>
> <gilles.gouaillar...@gmail.com>
> Reply-To: Open MPI Users <us...@lists.open-mpi.org>
> <us...@lists.open-mpi.org>
> To: Open MPI Users <us...@lists.open-mpi.org> <us...@lists.open-mpi.org>
>
>
> Siegmar,
>
> Your slot list is correct.
> An invalid slot list for your node would be 0:1-7,1:0-7
>
> /* and since the test requires only 5 tasks, that could even work with
> such an invalid list.
> My vm is single socket with 4 cores, so a 0:0-4 slot list results in an
> unfriendly pmix error */
>
> Bottom line, your test is correct, and there is a bug in v2.0.x that I
> will investigate from tomorrow
>
> Cheers,
>
> Gilles
>
> On Wednesday, January 11, 2017, Siegmar Gross <
> siegmar.gr...@informatik.hs-fulda.de> wrote:
>
>> Hi Gilles,
>>
>> thank you very much for your help. What does incorrect slot list
>> mean? My machine has two 6-core processors so that I specified
>> "--slot-list 0:0-5,1:0-5". Does incorrect mean that it isn't
>> allowed to specify more slots than available, to specify fewer
>> slots than available, or to specify more slots than needed for
>> the processes?
>>
>>
>> Kind regards
>>
>> Siegmar
>>
>> Am 11.01.2017 um 10:04 schrieb Gilles Gouaillardet:
>>
>>> Siegmar,
>>>
>>> I was able to reproduce the issue on my vm
>>> (No need for a real heterogeneous cluster here)
>>>
>>> I will keep digging tomorrow.
>>> Note that if you specify an incorrect slot list, MPI_Comm_spawn fails
>>> with a very unfriendly error message.
>>> Right now, the 4th spawn'ed task crashes, so this is a different issue
>>>
>>> Cheers,
>>>
>>> Gilles
>>>
>>> r...@open-mpi.org wrote:
>>> I think there is some relevant discussion here:
>>> https://github.com/open-mpi/ompi/issues/1569
>>>
>>> It looks like Gilles had (at least at one point) a fix for master when
>>> enable-heterogeneous, but I don’t know if that was committed.
>>>
>>> On Jan 9, 2017, at 8:23 AM, Howard Pritchard <hpprit...@gmail.com
>>>> <mailto:hpprit...@gmail.com>> wrote:
>>>>
>>>> HI Siegmar,
>>>>
>>>> You have some config parameters I wasn't trying that may have some
>>>> impact.
>>>> I'll give a try with these parameters.
>>>>
>>>> This should be enough info for now,
>>>>
>>>> Thanks,
>>>>
>>>> Howard
>>>>
>>>>
>>>> 2017-01-09 0:59 GMT-07:00 Siegmar Gross <siegmar.gr...@informatik.hs-f
>>>> ulda.de <mailto:siegmar.gr...@informatik.hs-fulda.de>>:
>>>>
>>>>     Hi Howard,
>>>>
>>>>     I use the following commands to build and install the package.
>>>>     ${SYSTEM_ENV} is "Linux" and ${MACHINE_ENV} is "x86_64" for my
>>>>     Linux machine.
>>>>
>>>>     mkdir openmpi-2.0.2rc3-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc
>>>>     cd openmpi-2.0.2rc3-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc
>>>>
>>>>     ../openmpi-2.0.2rc3/configure \
>>>>       --prefix=/usr/local/openmpi-2.0.2_64_cc \
>>>>       --libdir=/usr/local/openmpi-2.0.2_64_cc/lib64 \
>>>>       --with-jdk-bindir=/usr/local/jdk1.8.0_66/bin \
>>>>       --with-jdk-headers=/usr/local/jdk1.8.0_66/include \
>>>>       JAVA_HOME=/usr/local/jdk1.8.0_66 \
>>>>       LDFLAGS="-m64 -mt -Wl,-z -Wl,noexecstack" CC="cc" CXX="CC"
>>>> FC="f95" \
>>>>       CFLAGS="-m64 -mt" CXXFLAGS="-m64" FCFLAGS="-m64" \
>>>>       CPP="cpp" CXXCPP="cpp" \
>>>>       --enable-mpi-cxx \
>>>>       --enable-mpi-cxx-bindings \
>>>>       --enable-cxx-exceptions \
>>>>       --enable-mpi-java \
>>>>       --enable-heterogeneous \
>>>>       --enable-mpi-thread-multiple \
>>>>       --with-hwloc=internal \
>>>>       --without-verbs \
>>>>       --with-wrapper-cflags="-m64 -mt" \
>>>>       --with-wrapper-cxxflags="-m64" \
>>>>       --with-wrapper-fcflags="-m64" \
>>>>       --with-wrapper-ldflags="-mt" \
>>>>       --enable-debug \
>>>>       |& tee log.configure.$SYSTEM_ENV.$MACHINE_ENV.64_cc
>>>>
>>>>     make |& tee log.make.$SYSTEM_ENV.$MACHINE_ENV.64_cc
>>>>     rm -r /usr/local/openmpi-2.0.2_64_cc.old
>>>>     mv /usr/local/openmpi-2.0.2_64_cc /usr/local/openmpi-2.0.2_64_cc
>>>> .old
>>>>     make install |& tee log.make-install.$SYSTEM_ENV.$MACHINE_ENV.64_cc
>>>>     make check |& tee log.make-check.$SYSTEM_ENV.$MACHINE_ENV.64_cc
>>>>
>>>>
>>>>     I get a different error if I run the program with gdb.
>>>>
>>>>     loki spawn 118 gdb /usr/local/openmpi-2.0.2_64_cc/bin/mpiexec
>>>>     GNU gdb (GDB; SUSE Linux Enterprise 12) 7.11.1
>>>>     Copyright (C) 2016 Free Software Foundation, Inc.
>>>>     License GPLv3+: GNU GPL version 3 or later <
>>>> http://gnu.org/licenses/gpl.html <http://gnu.org/licenses/gpl.html>>
>>>>     This is free software: you are free to change and redistribute it.
>>>>     There is NO WARRANTY, to the extent permitted by law.  Type "show
>>>> copying"
>>>>     and "show warranty" for details.
>>>>     This GDB was configured as "x86_64-suse-linux".
>>>>     Type "show configuration" for configuration details.
>>>>     For bug reporting instructions, please see:
>>>>     <http://bugs.opensuse.org/>.
>>>>     Find the GDB manual and other documentation resources online at:
>>>>     <http://www.gnu.org/software/gdb/documentation/ <
>>>> http://www.gnu.org/software/gdb/documentation/>>.
>>>>     For help, type "help".
>>>>     Type "apropos word" to search for commands related to "word"...
>>>>     Reading symbols from /usr/local/openmpi-2.0.2_64_cc
>>>> /bin/mpiexec...done.
>>>>     (gdb) r -np 1 --host loki --slot-list 0:0-5,1:0-5 spawn_master
>>>>     Starting program: /usr/local/openmpi-2.0.2_64_cc/bin/mpiexec -np 1
>>>> --host loki --slot-list 0:0-5,1:0-5 spawn_master
>>>>     Missing separate debuginfos, use: zypper install
>>>> glibc-debuginfo-2.24-2.3.x86_64
>>>>     [Thread debugging using libthread_db enabled]
>>>>     Using host libthread_db library "/lib64/libthread_db.so.1".
>>>>     [New Thread 0x7ffff3b97700 (LWP 13582)]
>>>>     [New Thread 0x7ffff18a4700 (LWP 13583)]
>>>>     [New Thread 0x7ffff10a3700 (LWP 13584)]
>>>>     [New Thread 0x7fffebbba700 (LWP 13585)]
>>>>     Detaching after fork from child process 13586.
>>>>
>>>>     Parent process 0 running on loki
>>>>       I create 4 slave processes
>>>>
>>>>     Detaching after fork from child process 13589.
>>>>     Detaching after fork from child process 13590.
>>>>     Detaching after fork from child process 13591.
>>>>     [loki:13586] OPAL ERROR: Timeout in file
>>>> ../../../../openmpi-2.0.2rc3/opal/mca/pmix/base/pmix_base_fns.c at
>>>> line 193
>>>>     [loki:13586] *** An error occurred in MPI_Comm_spawn
>>>>     [loki:13586] *** reported by process [2873294849,0]
>>>>     [loki:13586] *** on communicator MPI_COMM_WORLD
>>>>     [loki:13586] *** MPI_ERR_UNKNOWN: unknown error
>>>>     [loki:13586] *** MPI_ERRORS_ARE_FATAL (processes in this
>>>> communicator will now abort,
>>>>     [loki:13586] ***    and potentially your MPI job)
>>>>     [Thread 0x7fffebbba700 (LWP 13585) exited]
>>>>     [Thread 0x7ffff10a3700 (LWP 13584) exited]
>>>>     [Thread 0x7ffff18a4700 (LWP 13583) exited]
>>>>     [Thread 0x7ffff3b97700 (LWP 13582) exited]
>>>>     [Inferior 1 (process 13567) exited with code 016]
>>>>     Missing separate debuginfos, use: zypper install
>>>> libpciaccess0-debuginfo-0.13.2-5.1.x86_64
>>>> libudev1-debuginfo-210-116.3.3.x86_64
>>>>     (gdb) bt
>>>>     No stack.
>>>>     (gdb)
>>>>
>>>>     Do you need anything else?
>>>>
>>>>
>>>>     Kind regards
>>>>
>>>>     Siegmar
>>>>
>>>>     Am 08.01.2017 um 17:02 schrieb Howard Pritchard:
>>>>
>>>>         HI Siegmar,
>>>>
>>>>         Could you post the configury options you use when building the
>>>> 2.0.2rc3?
>>>>         Maybe that will help in trying to reproduce the segfault you
>>>> are observing.
>>>>
>>>>         Howard
>>>>
>>>>
>>>>         2017-01-07 2:30 GMT-07:00 Siegmar Gross <
>>>> siegmar.gr...@informatik.hs-fulda.de <mailto:siegmar.gross@informat
>>>> ik.hs-fulda.de>
>>>>         <mailto:siegmar.gr...@informatik.hs-fulda.de <mailto:
>>>> siegmar.gr...@informatik.hs-fulda.de>>>:
>>>>
>>>>             Hi,
>>>>
>>>>             I have installed openmpi-2.0.2rc3 on my "SUSE Linux
>>>> Enterprise
>>>>             Server 12 (x86_64)" with Sun C 5.14 and gcc-6.3.0.
>>>> Unfortunately,
>>>>             I still get the same error that I reported for rc2.
>>>>
>>>>             I would be grateful, if somebody can fix the problem before
>>>>             releasing the final version. Thank you very much for any
>>>> help
>>>>             in advance.
>>>>
>>>>
>>>>             Kind regards
>>>>
>>>>             Siegmar
>>>>             _______________________________________________
>>>>             users mailing list
>>>>             us...@lists.open-mpi.org <mailto:us...@lists.open-mpi.org>
>>>> <mailto:us...@lists.open-mpi.org <mailto:us...@lists.open-mpi.org>>
>>>>             https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
>>>>         <https://rfd.newmexicoconsortium.org/mailman/listinfo/users <
>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users>>
>>>>
>>>>
>>>>
>>>>
>>>>         _______________________________________________
>>>>         users mailing list
>>>>         us...@lists.open-mpi.org <mailto:us...@lists.open-mpi.org>
>>>>         https://rfd.newmexicoconsortium.org/mailman/listinfo/users <
>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
>>>>
>>>>     _______________________________________________
>>>>     users mailing list
>>>>     us...@lists.open-mpi.org <mailto:us...@lists.open-mpi.org>
>>>>     https://rfd.newmexicoconsortium.org/mailman/listinfo/users <
>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@lists.open-mpi.org <mailto:us...@lists.open-mpi.org>
>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> us...@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>>
>>> _______________________________________________
>> users mailing list
>> us...@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>
>
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Reply via email to