Re: [OMPI devel] toward a unique session directory

Gilles Gouaillardet Thu, 15 Sep 2016 00:53:49 -0700

Ralph,


my reply is in the text


On 9/15/2016 11:11 AM, r...@open-mpi.org wrote:

If we are going to make a change, then let’s do it only once. Since weintroduced PMIx and the concept of the string namespace, the plan hasbeen to switch away from a numerical jobid and to the namespace. Thiseliminates the issue of the hash altogether. If we are going to make adisruptive change, then let’s do that one. Either way, this isn’tsomething that could go into the 2.x series. It is far too invasive,and would have to be delayed until a 3.x at the earliest.

got it !

Note that I am not yet convinced that is the issue here. We’ve hadthis hash for 12 years, and this is the first time someone has claimedto see a problem. That makes me very suspicious that the root causeisn’t what you are pursuing. This is only being reported for_singletons_, and that is a very unique code path. The only reason forlaunching the orted is to support PMIx operations such as notificationand comm_spawn. If those aren’t being used, then we could use the“isolated” mode where the usock OOB isn’t even activated, thuseliminating the problem. This would be a much smaller “fix” and couldpotentially fit into 2.x

a bug has been identified and fixed, let's wait and see how things go

how can i use the isolated mode ?
shall i simply
export OMPI_MCA_pmix=isolated
export OMPI_MCA_plm=isolated
?

out of curiosity, does "isolated" means we would not even need to forkthe HNP ?

FWIW: every organization I’ve worked with has an epilog script thatblows away temp dirs. It isn’t the RM-based environment that is ofconcern - it’s the non-RM one where epilog scripts don’t exist that isthe problem.

well, i was looking at this the other way around.

if mpirun/orted creates the session directory with mkstemp(), then thereis no more need to do any cleanup

(as long as you do not run out of disk space)

but with direct run, there is always a little risk that a previoussession directory is used, hence the requirement for an epilogue.also, if the RM is configured to run one job at a time per a given node,epilog can be quite trivial.but if several jobs can run on a given node at the same time, epilogbecome less trivial



Cheers,

Gilles

On Sep 14, 2016, at 6:05 PM, Gilles Gouaillardet <gil...@rist.or.jp<mailto:gil...@rist.or.jp>> wrote:


Ralph,


On 9/15/2016 12:11 AM, r...@open-mpi.org wrote:

Many things are possible, given infinite time :-)

i could not agree more :-D

The issue with this notion lies in direct launch scenarios - i.e.,when procs are launched directly by the RM and not via mpirun. Inthis case, there is nobody who can give us the session directory(well, until PMIx becomes universal), and so the apps must be ableto generate a name that they all can know. Otherwise, we lose sharedmemory support because they can’t rendezvous.

thanks for the explanation,
now let me rephrase that

"a MPI task must be able to rebuild the path to the sessiondirectory, based on the information it has when launched.if mpirun is used, we have several options to pass this option to theMPI tasks.in case of direct run, this info is unlikely (PMIx is not universal(yet)) passed by the batch manager, so we have to use what is available"

my concern is that, to keep things simple, session directory is basedon the Open MPI jobid, and since stepid is zero most of the time,jobid really means job family

which is stored on 16 bits.

in the case of mpirun, jobfam is a 16 bit hash of the hostname(reasonnable sized string) and the mpirun pid (32 bits on Linux)if several mpirun are invoked on the same host at a given time, thereis a risk two distinct jobs are assigned the same jobfam (since wehash from 32 bits down to 16 bits).also, there is a risk the session directory already exists from aprevious job, with some/all files and unix sockets from a previousjob, leading to undefined behavior

(an early crash if we are lucky, odd things otherwise).

in the case of direct run, i guess jobfam is a 16 bit hash of thejobid passed by the RM, and once again, there is a risk of conflictand/or the re-use of a previous session directory.

to me, the issue here is we are using the Open MPI jobfam in order tobuild the session directory path

instead, what if we

1) when mpirun, use a session directory created by mkstemp(), andpass it to MPI tasks via the environment or retrieve it fromorted/mpirun right after the communication has been established.2) for direct run, use a session directory based on the full jobid(which might be a string or a number) as passed by the RM

in case of 1), there is no more risk of a hash conflict, or re-usinga previous session directoryin case of 2), there is no more risk of a hash conflict, but there isstill a risk of re-using a session directory from a previous (e.g.terminated) job.that being said, once we document how the session directory is builtfrom the jobid, sysadmins will be able to write a RM epilog that doremove the session directory.


 does that make sense ?

However, that doesn’t seem to be the root problem here. I suspectthere is a bug in the code that spawns the orted from the singleton,and subsequently parses the returned connection info. If you look atthe error, you’ll see that both jobid’s have “zero” for their localjobid. This means that the two procs attempting to communicate boththink they are daemons, which is impossible in this scenario.
So something garbled the string that the orted returns on startup tothe singleton, and/or the singleton is parsing it incorrectly. IIRC,the singleton gets its name from that string, and so I expect it isgetting the wrong name - and hence the error.

i will investigate that.

As you may recall, you made a change a little while back where wemodified the code in ess/singleton to be a little less strict in itschecking of that returned string. I wonder if that is biting ushere? It wouldn’t fix the problem, but might generate a differenterror at a more obvious place.

do you meanhttps://github.com/open-mpi/ompi/commit/93e73841f9ec3739dec209365979e2badec0740f?

this has not been backported to v2.x, and the issue was reported on v2.x


Cheers,

Gilles

On Sep 14, 2016, at 8:00 AM, Gilles Gouaillardet<gilles.gouaillar...@gmail.com<mailto:gilles.gouaillar...@gmail.com>> wrote:


Ralph,

is there any reason to use a session directory based on the jobid(or job family) ?I mean, could we use mkstemp to generate a unique directory, andthen propagate the path via orted comm or the environment ?


Cheers,

Gilles

On Wednesday, September 14, 2016, r...@open-mpi.org<mailto:r...@open-mpi.org> <r...@open-mpi.org<mailto:r...@open-mpi.org>> wrote:


    This has nothing to do with PMIx, Josh - the error is coming
    out of the usock OOB component.

    On Sep 14, 2016, at 7:17 AM, Joshua Ladd <jladd.m...@gmail.com
    <javascript:_e(%7B%7D,'cvml','jladd.m...@gmail.com');>> wrote:

    Eric,

    We are looking into the PMIx code path that sets up the jobid.
    The session directories are created based on the jobid. It
    might be the case that the jobids (generated with rand) happen
    to be the same for different jobs resulting in multiple jobs
    sharing the same session directory, but we need to check. We
    will update.

    Josh

    On Wed, Sep 14, 2016 at 9:33 AM, Eric Chamberland
    <eric.chamberl...@giref.ulaval.ca
    <javascript:_e(%7B%7D,'cvml','eric.chamberl...@giref.ulaval.ca');>>
    wrote:

        Lucky!

        Since each runs have a specific TMP, I still have it on disc.

        for the faulty run, the TMP variable was:

        TMP=/tmp/tmp.wOv5dkNaSI

        and into $TMP I have:

        openmpi-sessions-40031@lorien_0

        and into this subdirectory I have a bunch of empty dirs:

        cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0>
        ls -la |wc -l
        1841

        cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0>
        ls -la |more
        total 68
        drwx------ 1840 cmpbib bib 45056 Sep 13 03:49 .
        drwx------    3 cmpbib bib   231 Sep 13 03:50 ..
        drwx------    2 cmpbib bib     6 Sep 13 02:10 10015
        drwx------    2 cmpbib bib     6 Sep 13 03:05 10049
        drwx------    2 cmpbib bib     6 Sep 13 03:15 10052
        drwx------    2 cmpbib bib     6 Sep 13 02:22 10059
        drwx------    2 cmpbib bib     6 Sep 13 02:22 10110
        drwx------    2 cmpbib bib     6 Sep 13 02:41 10114
        ...

        If I do:

        lsof |grep "openmpi-sessions-40031"
        lsof: WARNING: can't stat() fuse.gvfsd-fuse file system
        /run/user/1000/gvfs
              Output information may be incomplete.
        lsof: WARNING: can't stat() tracefs file system
        /sys/kernel/debug/tracing
              Output information may be incomplete.

        nothing...

        What else may I check?

        Eric


        On 14/09/16 08:47 AM, Joshua Ladd wrote:

            Hi, Eric

            I **think** this might be related to the following:

            https://github.com/pmix/master/pull/145
            <https://github.com/pmix/master/pull/145>

            I'm wondering if you can look into the /tmp directory
            and see if you
            have a bunch of stale usock files.

            Best,

            Josh


            On Wed, Sep 14, 2016 at 1:36 AM, Gilles Gouaillardet
            <gil...@rist.or.jp
            <javascript:_e(%7B%7D,'cvml','gil...@rist.or.jp');>
            <mailto:gil...@rist.or.jp
            <javascript:_e(%7B%7D,'cvml','gil...@rist.or.jp');>>>
            wrote:

                Eric,


                can you please provide more information on how
            your tests are launched ?

                do you

                mpirun -np 1 ./a.out

                or do you simply

                ./a.out


                do you use a batch manager ? if yes, which one ?

                do you run one test per job ? or multiple tests
            per job ?

                how are these tests launched ?


                do the test that crashes use MPI_Comm_spawn ?

                i am surprised by the process name
            [[9325,5754],0], which suggests
                there MPI_Comm_spawn was called 5753 times (!)


                can you also run

                hostname

                on the 'lorien' host ?

                if you configure'd Open MPI with --enable-debug,
            can you

                export OMPI_MCA_plm_base_verbose 5

                then run one test and post the logs ?


                from orte_plm_base_set_hnp_name(), "lorien" and
            pid 142766 should
                produce job family 5576 (but you get 9325)

                the discrepancy could be explained by the use of a
            batch manager
                and/or a full hostname i am unaware of.


            orte_plm_base_set_hnp_name() generate a 16 bits job
            family from the
                (32 bits hash of the) hostname and the mpirun (32
            bits ?) pid.

                so strictly speaking, it is possible two jobs
            launched on the same
                node are assigned the same 16 bits job family.


                the easiest way to detect this could be to

                - edit orte/mca/plm/base/plm_base_jobid.c

                and replace

            OPAL_OUTPUT_VERBOSE((5,
            orte_plm_base_framework.framework_output,
             "plm:base:set_hnp_name: final jobfam %lu",
             (unsigned long)jobfam));

                with

            OPAL_OUTPUT_VERBOSE((4,
            orte_plm_base_framework.framework_output,
             "plm:base:set_hnp_name: final jobfam %lu",
             (unsigned long)jobfam));

                configure Open MPI with --enable-debug and rebuild

                and then

                export OMPI_MCA_plm_base_verbose=4

                and run your tests.


                when the problem occurs, you will be able to check
            which pids
                produced the faulty jobfam, and that could hint to
            a conflict.


                Cheers,


                Gilles



                On 9/14/2016 12:35 AM, Eric Chamberland wrote:

                    Hi,

                    It is the third time this happened into the
            last 10 days.

                    While running nighlty tests (~2200), we have
            one or two tests
                    that fails at the very beginning with this
            strange error:

            [lorien:142766] [[9325,5754],0]
            usock_peer_recv_connect_ack:
                    received unexpected process identifier
            [[9325,0],0] from
            [[5590,0],0]

                    But I can't reproduce the problem right now...
            ie: If I launch
                    this test alone "by hand", it is successful...
            the same test was
                    successful yesterday...

                    Is there some kind of "race condition" that
            can happen on the
                    creation of "tmp" files if many tests runs
            together on the same
                    node? (we are oversubcribing even sequential
            runs...)

                    Here are the build logs:

            
http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log
            
<http://www.giref.ulaval.ca/%7Ecmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log>

<http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log

            
<http://www.giref.ulaval.ca/%7Ecmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log>>

            
http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt
            
<http://www.giref.ulaval.ca/%7Ecmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt>

<http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt

            
<http://www.giref.ulaval.ca/%7Ecmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt>>


                    Thanks,

                    Eric
            _______________________________________________
                    devel mailing list
            devel@lists.open-mpi.org
            <javascript:_e(%7B%7D,'cvml','devel@lists.open-mpi.org');>
            <mailto:devel@lists.open-mpi.org
            <javascript:_e(%7B%7D,'cvml','devel@lists.open-mpi.org');>>
            https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
            <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>

<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

            <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>>


            _______________________________________________
                devel mailing list
            devel@lists.open-mpi.org
            <javascript:_e(%7B%7D,'cvml','devel@lists.open-mpi.org');>
            <mailto:devel@lists.open-mpi.org
            <javascript:_e(%7B%7D,'cvml','devel@lists.open-mpi.org');>>
            https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
            <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>

<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

            <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>>


        _______________________________________________
        devel mailing list
        devel@lists.open-mpi.org
        <javascript:_e(%7B%7D,'cvml','devel@lists.open-mpi.org');>
        https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
        <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>


    _______________________________________________
    devel mailing list
    devel@lists.open-mpi.org
    <javascript:_e(%7B%7D,'cvml','devel@lists.open-mpi.org');>
    https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
    <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>


_______________________________________________
devel mailing list
devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel




_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel


_______________________________________________
devel mailing list
devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel




_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] toward a unique session directory

Reply via email to