Ralph,

my reply is in the text


On 9/15/2016 11:11 AM, r...@open-mpi.org wrote:
If we are going to make a change, then let’s do it only once. Since we introduced PMIx and the concept of the string namespace, the plan has been to switch away from a numerical jobid and to the namespace. This eliminates the issue of the hash altogether. If we are going to make a disruptive change, then let’s do that one. Either way, this isn’t something that could go into the 2.x series. It is far too invasive, and would have to be delayed until a 3.x at the earliest.

got it !
Note that I am not yet convinced that is the issue here. We’ve had this hash for 12 years, and this is the first time someone has claimed to see a problem. That makes me very suspicious that the root cause isn’t what you are pursuing. This is only being reported for _singletons_, and that is a very unique code path. The only reason for launching the orted is to support PMIx operations such as notification and comm_spawn. If those aren’t being used, then we could use the “isolated” mode where the usock OOB isn’t even activated, thus eliminating the problem. This would be a much smaller “fix” and could potentially fit into 2.x

a bug has been identified and fixed, let's wait and see how things go

how can i use the isolated mode ?
shall i simply
export OMPI_MCA_pmix=isolated
export OMPI_MCA_plm=isolated
?

out of curiosity, does "isolated" means we would not even need to fork the HNP ?


FWIW: every organization I’ve worked with has an epilog script that blows away temp dirs. It isn’t the RM-based environment that is of concern - it’s the non-RM one where epilog scripts don’t exist that is the problem.
well, i was looking at this the other way around.
if mpirun/orted creates the session directory with mkstemp(), then there is no more need to do any cleanup
(as long as you do not run out of disk space)
but with direct run, there is always a little risk that a previous session directory is used, hence the requirement for an epilogue. also, if the RM is configured to run one job at a time per a given node, epilog can be quite trivial. but if several jobs can run on a given node at the same time, epilog become less trivial


Cheers,

Gilles

On Sep 14, 2016, at 6:05 PM, Gilles Gouaillardet <gil...@rist.or.jp <mailto:gil...@rist.or.jp>> wrote:

Ralph,


On 9/15/2016 12:11 AM, r...@open-mpi.org wrote:
Many things are possible, given infinite time :-)

i could not agree more :-D
The issue with this notion lies in direct launch scenarios - i.e., when procs are launched directly by the RM and not via mpirun. In this case, there is nobody who can give us the session directory (well, until PMIx becomes universal), and so the apps must be able to generate a name that they all can know. Otherwise, we lose shared memory support because they can’t rendezvous.
thanks for the explanation,
now let me rephrase that
"a MPI task must be able to rebuild the path to the session directory, based on the information it has when launched. if mpirun is used, we have several options to pass this option to the MPI tasks. in case of direct run, this info is unlikely (PMIx is not universal (yet)) passed by the batch manager, so we have to use what is available"

my concern is that, to keep things simple, session directory is based on the Open MPI jobid, and since stepid is zero most of the time, jobid really means job family
which is stored on 16 bits.

in the case of mpirun, jobfam is a 16 bit hash of the hostname (reasonnable sized string) and the mpirun pid (32 bits on Linux) if several mpirun are invoked on the same host at a given time, there is a risk two distinct jobs are assigned the same jobfam (since we hash from 32 bits down to 16 bits). also, there is a risk the session directory already exists from a previous job, with some/all files and unix sockets from a previous job, leading to undefined behavior
(an early crash if we are lucky, odd things otherwise).

in the case of direct run, i guess jobfam is a 16 bit hash of the jobid passed by the RM, and once again, there is a risk of conflict and/or the re-use of a previous session directory.

to me, the issue here is we are using the Open MPI jobfam in order to build the session directory path
instead, what if we
1) when mpirun, use a session directory created by mkstemp(), and pass it to MPI tasks via the environment or retrieve it from orted/mpirun right after the communication has been established. 2) for direct run, use a session directory based on the full jobid (which might be a string or a number) as passed by the RM

in case of 1), there is no more risk of a hash conflict, or re-using a previous session directory in case of 2), there is no more risk of a hash conflict, but there is still a risk of re-using a session directory from a previous (e.g. terminated) job. that being said, once we document how the session directory is built from the jobid, sysadmins will be able to write a RM epilog that do remove the session directory.

 does that make sense ?

However, that doesn’t seem to be the root problem here. I suspect there is a bug in the code that spawns the orted from the singleton, and subsequently parses the returned connection info. If you look at the error, you’ll see that both jobid’s have “zero” for their local jobid. This means that the two procs attempting to communicate both think they are daemons, which is impossible in this scenario.

So something garbled the string that the orted returns on startup to the singleton, and/or the singleton is parsing it incorrectly. IIRC, the singleton gets its name from that string, and so I expect it is getting the wrong name - and hence the error.

i will investigate that.
As you may recall, you made a change a little while back where we modified the code in ess/singleton to be a little less strict in its checking of that returned string. I wonder if that is biting us here? It wouldn’t fix the problem, but might generate a different error at a more obvious place.

do you mean https://github.com/open-mpi/ompi/commit/93e73841f9ec3739dec209365979e2badec0740f ?
this has not been backported to v2.x, and the issue was reported on v2.x


Cheers,

Gilles

On Sep 14, 2016, at 8:00 AM, Gilles Gouaillardet <gilles.gouaillar...@gmail.com <mailto:gilles.gouaillar...@gmail.com>> wrote:

Ralph,

is there any reason to use a session directory based on the jobid (or job family) ? I mean, could we use mkstemp to generate a unique directory, and then propagate the path via orted comm or the environment ?

Cheers,

Gilles

On Wednesday, September 14, 2016, r...@open-mpi.org <mailto:r...@open-mpi.org> <r...@open-mpi.org <mailto:r...@open-mpi.org>> wrote:

    This has nothing to do with PMIx, Josh - the error is coming
    out of the usock OOB component.


    On Sep 14, 2016, at 7:17 AM, Joshua Ladd <jladd.m...@gmail.com
    <javascript:_e(%7B%7D,'cvml','jladd.m...@gmail.com');>> wrote:

    Eric,

    We are looking into the PMIx code path that sets up the jobid.
    The session directories are created based on the jobid. It
    might be the case that the jobids (generated with rand) happen
    to be the same for different jobs resulting in multiple jobs
    sharing the same session directory, but we need to check. We
    will update.

    Josh

    On Wed, Sep 14, 2016 at 9:33 AM, Eric Chamberland
    <eric.chamberl...@giref.ulaval.ca
    <javascript:_e(%7B%7D,'cvml','eric.chamberl...@giref.ulaval.ca');>>
    wrote:

        Lucky!

        Since each runs have a specific TMP, I still have it on disc.

        for the faulty run, the TMP variable was:

        TMP=/tmp/tmp.wOv5dkNaSI

        and into $TMP I have:

        openmpi-sessions-40031@lorien_0

        and into this subdirectory I have a bunch of empty dirs:

        cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0>
        ls -la |wc -l
        1841

        cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0>
        ls -la |more
        total 68
        drwx------ 1840 cmpbib bib 45056 Sep 13 03:49 .
        drwx------    3 cmpbib bib   231 Sep 13 03:50 ..
        drwx------    2 cmpbib bib     6 Sep 13 02:10 10015
        drwx------    2 cmpbib bib     6 Sep 13 03:05 10049
        drwx------    2 cmpbib bib     6 Sep 13 03:15 10052
        drwx------    2 cmpbib bib     6 Sep 13 02:22 10059
        drwx------    2 cmpbib bib     6 Sep 13 02:22 10110
        drwx------    2 cmpbib bib     6 Sep 13 02:41 10114
        ...

        If I do:

        lsof |grep "openmpi-sessions-40031"
        lsof: WARNING: can't stat() fuse.gvfsd-fuse file system
        /run/user/1000/gvfs
              Output information may be incomplete.
        lsof: WARNING: can't stat() tracefs file system
        /sys/kernel/debug/tracing
              Output information may be incomplete.

        nothing...

        What else may I check?

        Eric


        On 14/09/16 08:47 AM, Joshua Ladd wrote:

            Hi, Eric

            I **think** this might be related to the following:

            https://github.com/pmix/master/pull/145
            <https://github.com/pmix/master/pull/145>

            I'm wondering if you can look into the /tmp directory
            and see if you
            have a bunch of stale usock files.

            Best,

            Josh


            On Wed, Sep 14, 2016 at 1:36 AM, Gilles Gouaillardet
            <gil...@rist.or.jp
            <javascript:_e(%7B%7D,'cvml','gil...@rist.or.jp');>
            <mailto:gil...@rist.or.jp
            <javascript:_e(%7B%7D,'cvml','gil...@rist.or.jp');>>>
            wrote:

                Eric,


                can you please provide more information on how
            your tests are launched ?

                do you

                mpirun -np 1 ./a.out

                or do you simply

                ./a.out


                do you use a batch manager ? if yes, which one ?

                do you run one test per job ? or multiple tests
            per job ?

                how are these tests launched ?


                do the test that crashes use MPI_Comm_spawn ?

                i am surprised by the process name
            [[9325,5754],0], which suggests
                there MPI_Comm_spawn was called 5753 times (!)


                can you also run

                hostname

                on the 'lorien' host ?

                if you configure'd Open MPI with --enable-debug,
            can you

                export OMPI_MCA_plm_base_verbose 5

                then run one test and post the logs ?


                from orte_plm_base_set_hnp_name(), "lorien" and
            pid 142766 should
                produce job family 5576 (but you get 9325)

                the discrepancy could be explained by the use of a
            batch manager
                and/or a full hostname i am unaware of.


            orte_plm_base_set_hnp_name() generate a 16 bits job
            family from the
                (32 bits hash of the) hostname and the mpirun (32
            bits ?) pid.

                so strictly speaking, it is possible two jobs
            launched on the same
                node are assigned the same 16 bits job family.


                the easiest way to detect this could be to

                - edit orte/mca/plm/base/plm_base_jobid.c

                and replace

            OPAL_OUTPUT_VERBOSE((5,
            orte_plm_base_framework.framework_output,
             "plm:base:set_hnp_name: final jobfam %lu",
             (unsigned long)jobfam));

                with

            OPAL_OUTPUT_VERBOSE((4,
            orte_plm_base_framework.framework_output,
             "plm:base:set_hnp_name: final jobfam %lu",
             (unsigned long)jobfam));

                configure Open MPI with --enable-debug and rebuild

                and then

                export OMPI_MCA_plm_base_verbose=4

                and run your tests.


                when the problem occurs, you will be able to check
            which pids
                produced the faulty jobfam, and that could hint to
            a conflict.


                Cheers,


                Gilles



                On 9/14/2016 12:35 AM, Eric Chamberland wrote:

                    Hi,

                    It is the third time this happened into the
            last 10 days.

                    While running nighlty tests (~2200), we have
            one or two tests
                    that fails at the very beginning with this
            strange error:

            [lorien:142766] [[9325,5754],0]
            usock_peer_recv_connect_ack:
                    received unexpected process identifier
            [[9325,0],0] from
            [[5590,0],0]

                    But I can't reproduce the problem right now...
            ie: If I launch
                    this test alone "by hand", it is successful...
            the same test was
                    successful yesterday...

                    Is there some kind of "race condition" that
            can happen on the
                    creation of "tmp" files if many tests runs
            together on the same
                    node? (we are oversubcribing even sequential
            runs...)

                    Here are the build logs:

            
http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log
            
<http://www.giref.ulaval.ca/%7Ecmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log>
<http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log
            
<http://www.giref.ulaval.ca/%7Ecmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log>>

            
http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt
            
<http://www.giref.ulaval.ca/%7Ecmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt>
<http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt
            
<http://www.giref.ulaval.ca/%7Ecmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt>>


                    Thanks,

                    Eric
            _______________________________________________
                    devel mailing list
            devel@lists.open-mpi.org
            <javascript:_e(%7B%7D,'cvml','devel@lists.open-mpi.org');>
            <mailto:devel@lists.open-mpi.org
            <javascript:_e(%7B%7D,'cvml','devel@lists.open-mpi.org');>>
            https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
            <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
            <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>>


            _______________________________________________
                devel mailing list
            devel@lists.open-mpi.org
            <javascript:_e(%7B%7D,'cvml','devel@lists.open-mpi.org');>
            <mailto:devel@lists.open-mpi.org
            <javascript:_e(%7B%7D,'cvml','devel@lists.open-mpi.org');>>
            https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
            <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
            <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>>


        _______________________________________________
        devel mailing list
        devel@lists.open-mpi.org
        <javascript:_e(%7B%7D,'cvml','devel@lists.open-mpi.org');>
        https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
        <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>


    _______________________________________________
    devel mailing list
    devel@lists.open-mpi.org
    <javascript:_e(%7B%7D,'cvml','devel@lists.open-mpi.org');>
    https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
    <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>

_______________________________________________
devel mailing list
devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel



_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

_______________________________________________
devel mailing list
devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel



_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Reply via email to