On 14/09/16 10:27 AM, Gilles Gouaillardet wrote:
Eric,

do you mean you have a unique $TMP per a.out ?

No

or a unique $TMP per "batch" of run ?

Yes.

I was happy because each nighlty batch has it's own TMP, so I can check afterward for problems related to a specific night without interference with another nightly batch of tests... if a bug ever happens... ;)


in the first case, my understanding is that conflicts cannot happen ...

once you hit the bug, can you please please post the output of the
failed a.out,
and run
egrep 'jobfam|stop'
on all your logs, so we might spot a conflict


ok, I will launch it manually later today, but it will be automatic tonight (with export OMPI_MCA_plm_base_verbose=5).

Thanks!

Eric


Cheers,

Gilles

On Wednesday, September 14, 2016, Eric Chamberland
<eric.chamberl...@giref.ulaval.ca
<mailto:eric.chamberl...@giref.ulaval.ca>> wrote:

    Lucky!

    Since each runs have a specific TMP, I still have it on disc.

    for the faulty run, the TMP variable was:

    TMP=/tmp/tmp.wOv5dkNaSI

    and into $TMP I have:

    openmpi-sessions-40031@lorien_0

    and into this subdirectory I have a bunch of empty dirs:

    cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0>
    ls -la |wc -l
    1841

    cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0>
    ls -la |more
    total 68
    drwx------ 1840 cmpbib bib 45056 Sep 13 03:49 .
    drwx------    3 cmpbib bib   231 Sep 13 03:50 ..
    drwx------    2 cmpbib bib     6 Sep 13 02:10 10015
    drwx------    2 cmpbib bib     6 Sep 13 03:05 10049
    drwx------    2 cmpbib bib     6 Sep 13 03:15 10052
    drwx------    2 cmpbib bib     6 Sep 13 02:22 10059
    drwx------    2 cmpbib bib     6 Sep 13 02:22 10110
    drwx------    2 cmpbib bib     6 Sep 13 02:41 10114
    ...

    If I do:

    lsof |grep "openmpi-sessions-40031"
    lsof: WARNING: can't stat() fuse.gvfsd-fuse file system
    /run/user/1000/gvfs
          Output information may be incomplete.
    lsof: WARNING: can't stat() tracefs file system
    /sys/kernel/debug/tracing
          Output information may be incomplete.

    nothing...

    What else may I check?

    Eric


    On 14/09/16 08:47 AM, Joshua Ladd wrote:

        Hi, Eric

        I **think** this might be related to the following:

        https://github.com/pmix/master/pull/145
        <https://github.com/pmix/master/pull/145>

        I'm wondering if you can look into the /tmp directory and see if you
        have a bunch of stale usock files.

        Best,

        Josh


        On Wed, Sep 14, 2016 at 1:36 AM, Gilles Gouaillardet
        <gil...@rist.or.jp
        <mailto:gil...@rist.or.jp>> wrote:

            Eric,


            can you please provide more information on how your tests
        are launched ?

            do you

            mpirun -np 1 ./a.out

            or do you simply

            ./a.out


            do you use a batch manager ? if yes, which one ?

            do you run one test per job ? or multiple tests per job ?

            how are these tests launched ?


            do the test that crashes use MPI_Comm_spawn ?

            i am surprised by the process name [[9325,5754],0], which
        suggests
            there MPI_Comm_spawn was called 5753 times (!)


            can you also run

            hostname

            on the 'lorien' host ?

            if you configure'd Open MPI with --enable-debug, can you

            export OMPI_MCA_plm_base_verbose 5

            then run one test and post the logs ?


            from orte_plm_base_set_hnp_name(), "lorien" and pid 142766
        should
            produce job family 5576 (but you get 9325)

            the discrepancy could be explained by the use of a batch manager
            and/or a full hostname i am unaware of.


            orte_plm_base_set_hnp_name() generate a 16 bits job family
        from the
            (32 bits hash of the) hostname and the mpirun (32 bits ?) pid.

            so strictly speaking, it is possible two jobs launched on
        the same
            node are assigned the same 16 bits job family.


            the easiest way to detect this could be to

            - edit orte/mca/plm/base/plm_base_jobid.c

            and replace

                OPAL_OUTPUT_VERBOSE((5,
        orte_plm_base_framework.framework_output,
                                     "plm:base:set_hnp_name: final
        jobfam %lu",
                                     (unsigned long)jobfam));

            with

                OPAL_OUTPUT_VERBOSE((4,
        orte_plm_base_framework.framework_output,
                                     "plm:base:set_hnp_name: final
        jobfam %lu",
                                     (unsigned long)jobfam));

            configure Open MPI with --enable-debug and rebuild

            and then

            export OMPI_MCA_plm_base_verbose=4

            and run your tests.


            when the problem occurs, you will be able to check which pids
            produced the faulty jobfam, and that could hint to a conflict.


            Cheers,


            Gilles



            On 9/14/2016 12:35 AM, Eric Chamberland wrote:

                Hi,

                It is the third time this happened into the last 10 days.

                While running nighlty tests (~2200), we have one or two
        tests
                that fails at the very beginning with this strange error:

                [lorien:142766] [[9325,5754],0] usock_peer_recv_connect_ack:
                received unexpected process identifier [[9325,0],0] from
                [[5590,0],0]

                But I can't reproduce the problem right now... ie: If I
        launch
                this test alone "by hand", it is successful... the same
        test was
                successful yesterday...

                Is there some kind of "race condition" that can happen
        on the
                creation of "tmp" files if many tests runs together on
        the same
                node? (we are oversubcribing even sequential runs...)

                Here are the build logs:


        
http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log
        
<http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log>

        
<http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log
        
<http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log>>


        
http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt
        
<http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt>

        
<http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt
        
<http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt>>


                Thanks,

                Eric
                _______________________________________________
                devel mailing list
                devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>

        https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
        <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>

        <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
        <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>>


            _______________________________________________
            devel mailing list
            devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
            https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
        <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
            <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
        <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>>


    _______________________________________________
    devel mailing list
    devel@lists.open-mpi.org
    https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
    <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>

_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Reply via email to