Hi Howard and Gilles,

thank you very much for your help. All commands that work for
Gilles work also on my machine as expected and the commands that
don't work on his machine don't work on my neither. The first one
that works with both --slot-list and --host is the following
command so that it seems that the value depends on the number of
processes in the remote group.

loki spawn 122 mpirun -np 1 --slot-list 0:0-5,1:0-5 --host loki:3 spawn_master

Parent process 0 running on loki
  I create 4 slave processes

Parent process 0: tasks in MPI_COMM_WORLD:                    1
                  tasks in COMM_CHILD_PROCESSES local group:  1
                  tasks in COMM_CHILD_PROCESSES remote group: 3

Slave process 0 of 3 running on loki
spawn_slave 0: argv[0]: spawn_slave
Slave process 1 of 3 running on loki
spawn_slave 1: argv[0]: spawn_slave
Slave process 2 of 3 running on loki
spawn_slave 2: argv[0]: spawn_slave
loki spawn 123


Here is the output from the other commands.

loki spawn 112 mpirun -np 1 spawn_master

Parent process 0 running on loki
  I create 4 slave processes

Parent process 0: tasks in MPI_COMM_WORLD:                    1
                  tasks in COMM_CHILD_PROCESSES local group:  1
                  tasks in COMM_CHILD_PROCESSES remote group: 4

Slave process 1 of 4 running on loki
Slave process 2 of 4 running on loki
Slave process 3 of 4 running on loki
Slave process 0 of 4 running on loki
spawn_slave 3: argv[0]: spawn_slave
spawn_slave 1: argv[0]: spawn_slave
spawn_slave 2: argv[0]: spawn_slave
spawn_slave 0: argv[0]: spawn_slave
loki spawn 113 mpirun -np 1 --slot-list 0:0-5,1:0-5 spawn_master

Parent process 0 running on loki
  I create 4 slave processes

Slave process 0 of 4 running on loki
Slave process 1 of 4 running on loki
Slave process 2 of 4 running on loki
spawn_slave 2: argv[0]: spawn_slave
Slave process 3 of 4 running on loki
spawn_slave 3: argv[0]: spawn_slave
spawn_slave 0: argv[0]: spawn_slave
spawn_slave 1: argv[0]: spawn_slave
Parent process 0: tasks in MPI_COMM_WORLD:                    1
                  tasks in COMM_CHILD_PROCESSES local group:  1
                  tasks in COMM_CHILD_PROCESSES remote group: 4

loki spawn 114 mpirun -np 1 --host loki --oversubscribe spawn_master

Parent process 0 running on loki
  I create 4 slave processes

Slave process 0 of 4 running on loki
Slave process 1 of 4 running on loki
Slave process 2 of 4 running on loki
spawn_slave 2: argv[0]: spawn_slave
Slave process 3 of 4 running on loki
spawn_slave 3: argv[0]: spawn_slave
spawn_slave 0: argv[0]: spawn_slave
spawn_slave 1: argv[0]: spawn_slave
Parent process 0: tasks in MPI_COMM_WORLD:                    1
                  tasks in COMM_CHILD_PROCESSES local group:  1
                  tasks in COMM_CHILD_PROCESSES remote group: 4

loki spawn 115 mpirun -np 1 --slot-list 0:0-5,1:0-5 --host loki:12 spawn_master

Parent process 0 running on loki
  I create 4 slave processes

Slave process 0 of 4 running on loki
Slave process 2 of 4 running on loki
Slave process 1 of 4 running on loki
Slave process 3 of 4 running on loki
Parent process 0: tasks in MPI_COMM_WORLD:                    1
                  tasks in COMM_CHILD_PROCESSES local group:  1
                  tasks in COMM_CHILD_PROCESSES remote group: 4

spawn_slave 2: argv[0]: spawn_slave
spawn_slave 0: argv[0]: spawn_slave
spawn_slave 1: argv[0]: spawn_slave
spawn_slave 3: argv[0]: spawn_slave
loki spawn 116 mpirun -np 1 --host loki:12 --slot-list 0:0-5,1:0-5 spawn_master

Parent process 0 running on loki
  I create 4 slave processes

Slave process 0 of 4 running on loki
Slave process 1 of 4 running on loki
Slave process 2 of 4 running on loki
spawn_slave 2: argv[0]: spawn_slave
Slave process 3 of 4 running on loki
spawn_slave 3: argv[0]: spawn_slave
spawn_slave 0: argv[0]: spawn_slave
spawn_slave 1: argv[0]: spawn_slave
Parent process 0: tasks in MPI_COMM_WORLD:                    1
                  tasks in COMM_CHILD_PROCESSES local group:  1
                  tasks in COMM_CHILD_PROCESSES remote group: 4

loki spawn 117


Kind regards

Siegmar

Am 12.01.2017 um 22:25 schrieb r...@open-mpi.org:
Fix is pending here: https://github.com/open-mpi/ompi/pull/2730

On Jan 12, 2017, at 8:57 AM, Howard Pritchard <hpprit...@gmail.com 
<mailto:hpprit...@gmail.com>> wrote:

Siegmar,

Could you confirm that if you use one of the mpirun arg lists that works for 
Gilles that
your test case passes.  Something simple like

mpirun -np 1 ./spawn_master

?

Howard




2017-01-11 18:27 GMT-07:00 Gilles Gouaillardet <gil...@rist.or.jp 
<mailto:gil...@rist.or.jp>>:

    Ralph,


    so it seems the root cause is a kind of incompatibility between the --host 
and the --slot-list options


    on a single node with two six cores sockets,

    this works :

    mpirun -np 1 ./spawn_master
    mpirun -np 1 --slot-list 0:0-5,1:0-5 ./spawn_master
    mpirun -np 1 --host motomachi --oversubscribe ./spawn_master
    mpirun -np 1 --slot-list 0:0-5,1:0-5 --host motomachi:12 ./spawn_master


    this does not work

    mpirun -np 1 --host motomachi ./spawn_master # not enough slots available, 
aborts with a user friendly error message
    mpirun -np 1 --slot-list 0:0-5,1:0-5 --host motomachi ./spawn_master # 
various errors sm_segment_attach() fails, a task crashes
    and this ends up with the following error message

    At least one pair of MPI processes are unable to reach each other for
    MPI communications.  This means that no Open MPI device has indicated
    that it can be used to communicate between these processes.  This is
    an error; Open MPI requires that all MPI processes be able to reach
    each other.  This error can sometimes be the result of forgetting to
    specify the "self" BTL.

      Process 1 ([[15519,2],0]) is on host: motomachi
      Process 2 ([[15519,2],1]) is on host: unknown!
      BTLs attempted: self tcp

    mpirun -np 1 --slot-list 0:0-5,1:0-5 --host motomachi:1 ./spawn_master # 
same error as above
    mpirun -np 1 --slot-list 0:0-5,1:0-5 --host motomachi:2 ./spawn_master # 
same error as above


    for the record, the following command surprisingly works

    mpirun -np 1 --slot-list 0:0-5,1:0-5 --host motomachi:3 --mca btl tcp,self 
./spawn_master



    bottom line, my guess is that when the user specifies the --slot-list and 
the --host options
    *and* there are no default slot numbers to hosts, we should default to 
using the number
    of slots from the slot list.
    (e.g. in this case, defaults to --host motomachi:12 instead of (i guess) 
--host motomachi:1)


    /* fwiw, i made

    https://github.com/open-mpi/ompi/pull/2715 
<https://github.com/open-mpi/ompi/pull/2715>

    https://github.com/open-mpi/ompi/pull/2715 
<https://github.com/open-mpi/ompi/pull/2715>

    but these are not the root cause */


    Cheers,


    Gilles



    -------- Forwarded Message --------
    Subject:    Re: [OMPI users] still segmentation fault with openmpi-2.0.2rc3 
on Linux
    Date:       Wed, 11 Jan 2017 20:39:02 +0900
    From:       Gilles Gouaillardet <gilles.gouaillar...@gmail.com> 
<mailto:gilles.gouaillar...@gmail.com>
    Reply-To:   Open MPI Users <us...@lists.open-mpi.org> 
<mailto:us...@lists.open-mpi.org>
    To:         Open MPI Users <us...@lists.open-mpi.org> 
<mailto:us...@lists.open-mpi.org>



    Siegmar,

    Your slot list is correct.
    An invalid slot list for your node would be 0:1-7,1:0-7

    /* and since the test requires only 5 tasks, that could even work with such 
an invalid list.
    My vm is single socket with 4 cores, so a 0:0-4 slot list results in an 
unfriendly pmix error */

    Bottom line, your test is correct, and there is a bug in v2.0.x that I will 
investigate from tomorrow

    Cheers,

    Gilles

    On Wednesday, January 11, 2017, Siegmar Gross <siegmar.gr...@informatik.hs-fulda.de 
<mailto:siegmar.gr...@informatik.hs-fulda.de>> wrote:

        Hi Gilles,

        thank you very much for your help. What does incorrect slot list
        mean? My machine has two 6-core processors so that I specified
        "--slot-list 0:0-5,1:0-5". Does incorrect mean that it isn't
        allowed to specify more slots than available, to specify fewer
        slots than available, or to specify more slots than needed for
        the processes?


        Kind regards

        Siegmar

        Am 11.01.2017 um 10:04 schrieb Gilles Gouaillardet:

            Siegmar,

            I was able to reproduce the issue on my vm
            (No need for a real heterogeneous cluster here)

            I will keep digging tomorrow.
            Note that if you specify an incorrect slot list, MPI_Comm_spawn 
fails with a very unfriendly error message.
            Right now, the 4th spawn'ed task crashes, so this is a different 
issue

            Cheers,

            Gilles

            r...@open-mpi.org wrote:
            I think there is some relevant discussion here: 
https://github.com/open-mpi/ompi/issues/1569 
<https://github.com/open-mpi/ompi/issues/1569>

            It looks like Gilles had (at least at one point) a fix for master 
when enable-heterogeneous, but I don’t know if that was committed.

                On Jan 9, 2017, at 8:23 AM, Howard Pritchard <hpprit...@gmail.com 
<mailto:hpprit...@gmail.com>> wrote:

                HI Siegmar,

                You have some config parameters I wasn't trying that may have 
some impact.
                I'll give a try with these parameters.

                This should be enough info for now,

                Thanks,

                Howard


                2017-01-09 0:59 GMT-07:00 Siegmar Gross 
<siegmar.gr...@informatik.hs-fulda.de 
<mailto:siegmar.gr...@informatik.hs-fulda.de>>:

                    Hi Howard,

                    I use the following commands to build and install the 
package.
                    ${SYSTEM_ENV} is "Linux" and ${MACHINE_ENV} is "x86_64" for 
my
                    Linux machine.

                    mkdir openmpi-2.0.2rc3-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc
                    cd openmpi-2.0.2rc3-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc

                    ../openmpi-2.0.2rc3/configure \
                      --prefix=/usr/local/openmpi-2.0.2_64_cc \
                      --libdir=/usr/local/openmpi-2.0.2_64_cc/lib64 \
                      --with-jdk-bindir=/usr/local/jdk1.8.0_66/bin \
                      --with-jdk-headers=/usr/local/jdk1.8.0_66/include \
                      JAVA_HOME=/usr/local/jdk1.8.0_66 \
                      LDFLAGS="-m64 -mt -Wl,-z -Wl,noexecstack" CC="cc" CXX="CC" 
FC="f95" \
                      CFLAGS="-m64 -mt" CXXFLAGS="-m64" FCFLAGS="-m64" \
                      CPP="cpp" CXXCPP="cpp" \
                      --enable-mpi-cxx \
                      --enable-mpi-cxx-bindings \
                      --enable-cxx-exceptions \
                      --enable-mpi-java \
                      --enable-heterogeneous \
                      --enable-mpi-thread-multiple \
                      --with-hwloc=internal \
                      --without-verbs \
                      --with-wrapper-cflags="-m64 -mt" \
                      --with-wrapper-cxxflags="-m64" \
                      --with-wrapper-fcflags="-m64" \
                      --with-wrapper-ldflags="-mt" \
                      --enable-debug \
                      |& tee log.configure.$SYSTEM_ENV.$MACHINE_ENV.64_cc

                    make |& tee log.make.$SYSTEM_ENV.$MACHINE_ENV.64_cc
                    rm -r /usr/local/openmpi-2.0.2_64_cc.old
                    mv /usr/local/openmpi-2.0.2_64_cc 
/usr/local/openmpi-2.0.2_64_cc.old
                    make install |& tee 
log.make-install.$SYSTEM_ENV.$MACHINE_ENV.64_cc
                    make check |& tee 
log.make-check.$SYSTEM_ENV.$MACHINE_ENV.64_cc


                    I get a different error if I run the program with gdb.

                    loki spawn 118 gdb 
/usr/local/openmpi-2.0.2_64_cc/bin/mpiexec
                    GNU gdb (GDB; SUSE Linux Enterprise 12) 7.11.1
                    Copyright (C) 2016 Free Software Foundation, Inc.
                    License GPLv3+: GNU GPL version 3 or later 
<http://gnu.org/licenses/gpl.html <http://gnu.org/licenses/gpl.html>
                <http://gnu.org/licenses/gpl.html 
<http://gnu.org/licenses/gpl.html>>>
                    This is free software: you are free to change and 
redistribute it.
                    There is NO WARRANTY, to the extent permitted by law.  Type 
"show copying"
                    and "show warranty" for details.
                    This GDB was configured as "x86_64-suse-linux".
                    Type "show configuration" for configuration details.
                    For bug reporting instructions, please see:
                    <http://bugs.opensuse.org/>.
                    Find the GDB manual and other documentation resources 
online at:
                    <http://www.gnu.org/software/gdb/documentation/ 
<http://www.gnu.org/software/gdb/documentation/>
                <http://www.gnu.org/software/gdb/documentation/ 
<http://www.gnu.org/software/gdb/documentation/>>>.
                    For help, type "help".
                    Type "apropos word" to search for commands related to 
"word"...
                    Reading symbols from 
/usr/local/openmpi-2.0.2_64_cc/bin/mpiexec...done.
                    (gdb) r -np 1 --host loki --slot-list 0:0-5,1:0-5 
spawn_master
                    Starting program: 
/usr/local/openmpi-2.0.2_64_cc/bin/mpiexec -np 1 --host loki --slot-list 
0:0-5,1:0-5 spawn_master
                    Missing separate debuginfos, use: zypper install 
glibc-debuginfo-2.24-2.3.x86_64
                    [Thread debugging using libthread_db enabled]
                    Using host libthread_db library "/lib64/libthread_db.so.1".
                    [New Thread 0x7ffff3b97700 (LWP 13582)]
                    [New Thread 0x7ffff18a4700 (LWP 13583)]
                    [New Thread 0x7ffff10a3700 (LWP 13584)]
                    [New Thread 0x7fffebbba700 (LWP 13585)]
                    Detaching after fork from child process 13586.

                    Parent process 0 running on loki
                      I create 4 slave processes

                    Detaching after fork from child process 13589.
                    Detaching after fork from child process 13590.
                    Detaching after fork from child process 13591.
                    [loki:13586] OPAL ERROR: Timeout in file 
../../../../openmpi-2.0.2rc3/opal/mca/pmix/base/pmix_base_fns.c at line 193
                    [loki:13586] *** An error occurred in MPI_Comm_spawn
                    [loki:13586] *** reported by process [2873294849,0]
                    [loki:13586] *** on communicator MPI_COMM_WORLD
                    [loki:13586] *** MPI_ERR_UNKNOWN: unknown error
                    [loki:13586] *** MPI_ERRORS_ARE_FATAL (processes in this 
communicator will now abort,
                    [loki:13586] ***    and potentially your MPI job)
                    [Thread 0x7fffebbba700 (LWP 13585) exited]
                    [Thread 0x7ffff10a3700 (LWP 13584) exited]
                    [Thread 0x7ffff18a4700 (LWP 13583) exited]
                    [Thread 0x7ffff3b97700 (LWP 13582) exited]
                    [Inferior 1 (process 13567) exited with code 016]
                    Missing separate debuginfos, use: zypper install 
libpciaccess0-debuginfo-0.13.2-5.1.x86_64 libudev1-debuginfo-210-116.3.3.x86_64
                    (gdb) bt
                    No stack.
                    (gdb)

                    Do you need anything else?


                    Kind regards

                    Siegmar

                    Am 08.01.2017 um 17:02 schrieb Howard Pritchard:

                        HI Siegmar,

                        Could you post the configury options you use when 
building the 2.0.2rc3?
                        Maybe that will help in trying to reproduce the 
segfault you are observing.

                        Howard


                        2017-01-07 2:30 GMT-07:00 Siegmar Gross 
<siegmar.gr...@informatik.hs-fulda.de 
<mailto:siegmar.gr...@informatik.hs-fulda.de>
                        <mailto:siegmar.gr...@informatik.hs-fulda.de 
<mailto:siegmar.gr...@informatik.hs-fulda.de>>>:

                            Hi,

                            I have installed openmpi-2.0.2rc3 on my "SUSE Linux 
Enterprise
                            Server 12 (x86_64)" with Sun C 5.14 and gcc-6.3.0. 
Unfortunately,
                            I still get the same error that I reported for rc2.

                            I would be grateful, if somebody can fix the 
problem before
                            releasing the final version. Thank you very much 
for any help
                            in advance.


                            Kind regards

                            Siegmar
                            _______________________________________________
                            users mailing list
                            us...@lists.open-mpi.org <mailto:us...@lists.open-mpi.org> 
<mailto:us...@lists.open-mpi.org <mailto:us...@lists.open-mpi.org>>
                            
https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
<https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
                <https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
<https://rfd.newmexicoconsortium.org/mailman/listinfo/users>>
                        <https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
<https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
                <https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
<https://rfd.newmexicoconsortium.org/mailman/listinfo/users>>>




                        _______________________________________________
                        users mailing list
                        us...@lists.open-mpi.org 
<mailto:us...@lists.open-mpi.org>
                        https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
<https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
                <https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
<https://rfd.newmexicoconsortium.org/mailman/listinfo/users>>

                    _______________________________________________
                    users mailing list
                    us...@lists.open-mpi.org <mailto:us...@lists.open-mpi.org>
                    https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
<https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
                <https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
<https://rfd.newmexicoconsortium.org/mailman/listinfo/users>>


                _______________________________________________
                users mailing list
                us...@lists.open-mpi.org <mailto:us...@lists.open-mpi.org>
                https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
<https://rfd.newmexicoconsortium.org/mailman/listinfo/users>




            _______________________________________________
            users mailing list
            us...@lists.open-mpi.org
            https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
<https://rfd.newmexicoconsortium.org/mailman/listinfo/users>

        _______________________________________________
        users mailing list
        us...@lists.open-mpi.org
        https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
<https://rfd.newmexicoconsortium.org/mailman/listinfo/users>


    _______________________________________________
    devel mailing list
    devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
    https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>


_______________________________________________
devel mailing list
devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel



_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Reply via email to