Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

r...@open-mpi.org Wed, 14 Sep 2016 20:20:07 -0700

Nah, something isn’t right here. The singleton doesn’t go thru that code line, 
or it isn’t supposed to do so. I think the problem lies in the way the 
singleton in 2.x is starting up. Let me take a look at how singletons are 
working over there.


> On Sep 14, 2016, at 8:10 PM, Gilles Gouaillardet <gil...@rist.or.jp> wrote:
> 
> Ralph,
> 
> 
> i think i just found the root cause :-)
> 
> 
> from pmix1_client_init() in opal/mca/pmix/pmix112/pmix1_client.c
> 
>   /* store our jobid and rank */
>   if (NULL != getenv(OPAL_MCA_PREFIX"orte_launch")) {
>        /* if we were launched by the OMPI RTE, then
>         * the jobid is in a special format - so get it */
>        mca_pmix_pmix112_component.native_launch = true;
>        opal_convert_string_to_jobid(&pname.jobid, my_proc.nspace);
>    } else {
>        /* we were launched by someone else, so make the
>         * jobid just be the hash of the nspace */
>        OPAL_HASH_STR(my_proc.nspace, pname.jobid);
>        /* keep it from being negative */
>        pname.jobid &= ~(0x8000);
>    }
> 
> 
> in this case, there is no OPAL_MCA_PREFIX"orte_launch" in the environment,
> so we end up using a jobid that is a hash of the namespace
> 
> the initial error message was
> [lorien:142766] [[9325,5754],0] usock_peer_recv_connect_ack: received 
> unexpected process identifier [[9325,0],0] from [[5590,0],0]
> 
> and with little surprise
> ((9325 << 16) + 1) and ((5590 << 16) + 1) have the same hash as calculated by 
> OPAL_HASH_STR
> 
> 
> i am now thinking the right fix is to simply
> putenv(OPAL_MCA_PREFIX"orte_launch=1");
> in ess/singleton
> 
> makes sense ?
> 
> 
> Cheers,
> 
> Gilles
> 
> On 9/15/2016 2:58 AM, Eric Chamberland wrote:
>> Ok,
>> 
>> one test segfaulted *but* I can't tell if it is the *same* bug because there 
>> has been a segfault:
>> 
>> stderr:
>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.faultyCerr.Triangle.h_cte_1.txt
>>  
>> 
>> [lorien:190552] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path 
>> NULL
>> [lorien:190552] plm:base:set_hnp_name: initial bias 190552 nodename hash 
>> 1366255883
>> [lorien:190552] plm:base:set_hnp_name: final jobfam 53310
>> [lorien:190552] [[53310,0],0] plm:rsh_setup on agent ssh : rsh path NULL
>> [lorien:190552] [[53310,0],0] plm:base:receive start comm
>> *** Error in `orted': realloc(): invalid next size: 0x0000000001e58770 ***
>> ...
>> ...
>> [lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon 
>> on the local node in file ess_singleton_module.c at line 573
>> [lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon 
>> on the local node in file ess_singleton_module.c at line 163
>> *** An error occurred in MPI_Init_thread
>> *** on a NULL communicator
>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
>> ***    and potentially your MPI job)
>> [lorien:190306] Local abort before MPI_INIT completed completed 
>> successfully, but am not able to aggregate error messages, and not able to 
>> guarantee that all other processes were killed!
>> 
>> stdout:
>> 
>> -------------------------------------------------------------------------- 
>> It looks like orte_init failed for some reason; your parallel process is
>> likely to abort.  There are many reasons that a parallel process can
>> fail during orte_init; some of which are due to configuration or
>> environment problems.  This failure appears to be an internal failure;
>> here's some additional information (which may only be relevant to an
>> Open MPI developer):
>> 
>>  orte_ess_init failed
>>  --> Returned value Unable to start a daemon on the local node (-127) 
>> instead of ORTE_SUCCESS
>> -------------------------------------------------------------------------- 
>> -------------------------------------------------------------------------- 
>> It looks like MPI_INIT failed for some reason; your parallel process is
>> likely to abort.  There are many reasons that a parallel process can
>> fail during MPI_INIT; some of which are due to configuration or environment
>> problems.  This failure appears to be an internal failure; here's some
>> additional information (which may only be relevant to an Open MPI
>> developer):
>> 
>>  ompi_mpi_init: ompi_rte_init failed
>>  --> Returned "Unable to start a daemon on the local node" (-127) instead of 
>> "Success" (0)
>> -------------------------------------------------------------------------- 
>> 
>> openmpi content of $TMP:
>> 
>> /tmp/tmp.GoQXICeyJl> ls -la
>> total 1500
>> drwx------    3 cmpbib bib     250 Sep 14 13:34 .
>> drwxrwxrwt  356 root   root  61440 Sep 14 13:45 ..
>> ...
>> drwx------ 1848 cmpbib bib   45056 Sep 14 13:34 
>> openmpi-sessions-40031@lorien_0
>> srw-rw-r--    1 cmpbib bib       0 Sep 14 12:24 pmix-190552
>> 
>> cmpbib@lorien:/tmp/tmp.GoQXICeyJl/openmpi-sessions-40031@lorien_0> find . 
>> -type f
>> ./53310/contact.txt
>> 
>> cat 53310/contact.txt
>> 3493724160.0;usock;tcp://132.203.7.36:54605
>> 190552
>> 
>> egrep 'jobfam|stop' */*/Cerr* ../BIBTV/*/*/*/Cerr*|grep 53310
>> dev/Test.FonctionsSUPG/Cerr.Triangle.h_cte_1.txt:[lorien:190552] 
>> plm:base:set_hnp_name: final jobfam 53310
>> 
>> (this is the faulty test)
>> full egrep:
>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.egrep.txt
>>  
>> 
>> config.log:
>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s_config.log
>>  
>> 
>> ompi_info:
>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s_ompi_info_all.txt
>>  
>> 
>> Maybe it aborted (instead of giving the other message) while doing the error 
>> because of export OMPI_MCA_plm_base_verbose=5 ?
>> 
>> Thanks,
>> 
>> Eric
>> 
>> 
>> On 14/09/16 10:27 AM, Gilles Gouaillardet wrote:
>>> Eric,
>>> 
>>> do you mean you have a unique $TMP per a.out ?
>>> or a unique $TMP per "batch" of run ?
>>> 
>>> in the first case, my understanding is that conflicts cannot happen ...
>>> 
>>> once you hit the bug, can you please please post the output of the
>>> failed a.out,
>>> and run
>>> egrep 'jobfam|stop'
>>> on all your logs, so we might spot a conflict
>>> 
>>> Cheers,
>>> 
>>> Gilles
>>> 
>>> On Wednesday, September 14, 2016, Eric Chamberland
>>> <eric.chamberl...@giref.ulaval.ca
>>> <mailto:eric.chamberl...@giref.ulaval.ca>> wrote:
>>> 
>>>    Lucky!
>>> 
>>>    Since each runs have a specific TMP, I still have it on disc.
>>> 
>>>    for the faulty run, the TMP variable was:
>>> 
>>>    TMP=/tmp/tmp.wOv5dkNaSI
>>> 
>>>    and into $TMP I have:
>>> 
>>>    openmpi-sessions-40031@lorien_0
>>> 
>>>    and into this subdirectory I have a bunch of empty dirs:
>>> 
>>> cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0>
>>>    ls -la |wc -l
>>>    1841
>>> 
>>> cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0>
>>>    ls -la |more
>>>    total 68
>>>    drwx------ 1840 cmpbib bib 45056 Sep 13 03:49 .
>>>    drwx------    3 cmpbib bib   231 Sep 13 03:50 ..
>>>    drwx------    2 cmpbib bib     6 Sep 13 02:10 10015
>>>    drwx------    2 cmpbib bib     6 Sep 13 03:05 10049
>>>    drwx------    2 cmpbib bib     6 Sep 13 03:15 10052
>>>    drwx------    2 cmpbib bib     6 Sep 13 02:22 10059
>>>    drwx------    2 cmpbib bib     6 Sep 13 02:22 10110
>>>    drwx------    2 cmpbib bib     6 Sep 13 02:41 10114
>>>    ...
>>> 
>>>    If I do:
>>> 
>>>    lsof |grep "openmpi-sessions-40031"
>>>    lsof: WARNING: can't stat() fuse.gvfsd-fuse file system
>>>    /run/user/1000/gvfs
>>>          Output information may be incomplete.
>>>    lsof: WARNING: can't stat() tracefs file system
>>>    /sys/kernel/debug/tracing
>>>          Output information may be incomplete.
>>> 
>>>    nothing...
>>> 
>>>    What else may I check?
>>> 
>>>    Eric
>>> 
>>> 
>>>    On 14/09/16 08:47 AM, Joshua Ladd wrote:
>>> 
>>>        Hi, Eric
>>> 
>>>        I **think** this might be related to the following:
>>> 
>>>        https://github.com/pmix/master/pull/145
>>>        <https://github.com/pmix/master/pull/145>
>>> 
>>>        I'm wondering if you can look into the /tmp directory and see if you
>>>        have a bunch of stale usock files.
>>> 
>>>        Best,
>>> 
>>>        Josh
>>> 
>>> 
>>>        On Wed, Sep 14, 2016 at 1:36 AM, Gilles Gouaillardet
>>>        <gil...@rist.or.jp
>>>        <mailto:gil...@rist.or.jp>> wrote:
>>> 
>>>            Eric,
>>> 
>>> 
>>>            can you please provide more information on how your tests
>>>        are launched ?
>>> 
>>>            do you
>>> 
>>>            mpirun -np 1 ./a.out
>>> 
>>>            or do you simply
>>> 
>>>            ./a.out
>>> 
>>> 
>>>            do you use a batch manager ? if yes, which one ?
>>> 
>>>            do you run one test per job ? or multiple tests per job ?
>>> 
>>>            how are these tests launched ?
>>> 
>>> 
>>>            do the test that crashes use MPI_Comm_spawn ?
>>> 
>>>            i am surprised by the process name [[9325,5754],0], which
>>>        suggests
>>>            there MPI_Comm_spawn was called 5753 times (!)
>>> 
>>> 
>>>            can you also run
>>> 
>>>            hostname
>>> 
>>>            on the 'lorien' host ?
>>> 
>>>            if you configure'd Open MPI with --enable-debug, can you
>>> 
>>>            export OMPI_MCA_plm_base_verbose 5
>>> 
>>>            then run one test and post the logs ?
>>> 
>>> 
>>>            from orte_plm_base_set_hnp_name(), "lorien" and pid 142766
>>>        should
>>>            produce job family 5576 (but you get 9325)
>>> 
>>>            the discrepancy could be explained by the use of a batch manager
>>>            and/or a full hostname i am unaware of.
>>> 
>>> 
>>>            orte_plm_base_set_hnp_name() generate a 16 bits job family
>>>        from the
>>>            (32 bits hash of the) hostname and the mpirun (32 bits ?) pid.
>>> 
>>>            so strictly speaking, it is possible two jobs launched on
>>>        the same
>>>            node are assigned the same 16 bits job family.
>>> 
>>> 
>>>            the easiest way to detect this could be to
>>> 
>>>            - edit orte/mca/plm/base/plm_base_jobid.c
>>> 
>>>            and replace
>>> 
>>>                OPAL_OUTPUT_VERBOSE((5,
>>>        orte_plm_base_framework.framework_output,
>>>                                     "plm:base:set_hnp_name: final
>>>        jobfam %lu",
>>>                                     (unsigned long)jobfam));
>>> 
>>>            with
>>> 
>>>                OPAL_OUTPUT_VERBOSE((4,
>>>        orte_plm_base_framework.framework_output,
>>>                                     "plm:base:set_hnp_name: final
>>>        jobfam %lu",
>>>                                     (unsigned long)jobfam));
>>> 
>>>            configure Open MPI with --enable-debug and rebuild
>>> 
>>>            and then
>>> 
>>>            export OMPI_MCA_plm_base_verbose=4
>>> 
>>>            and run your tests.
>>> 
>>> 
>>>            when the problem occurs, you will be able to check which pids
>>>            produced the faulty jobfam, and that could hint to a conflict.
>>> 
>>> 
>>>            Cheers,
>>> 
>>> 
>>>            Gilles
>>> 
>>> 
>>> 
>>>            On 9/14/2016 12:35 AM, Eric Chamberland wrote:
>>> 
>>>                Hi,
>>> 
>>>                It is the third time this happened into the last 10 days.
>>> 
>>>                While running nighlty tests (~2200), we have one or two
>>>        tests
>>>                that fails at the very beginning with this strange error:
>>> 
>>>                [lorien:142766] [[9325,5754],0] usock_peer_recv_connect_ack:
>>>                received unexpected process identifier [[9325,0],0] from
>>>                [[5590,0],0]
>>> 
>>>                But I can't reproduce the problem right now... ie: If I
>>>        launch
>>>                this test alone "by hand", it is successful... the same
>>>        test was
>>>                successful yesterday...
>>> 
>>>                Is there some kind of "race condition" that can happen
>>>        on the
>>>                creation of "tmp" files if many tests runs together on
>>>        the same
>>>                node? (we are oversubcribing even sequential runs...)
>>> 
>>>                Here are the build logs:
>>> 
>>> 
>>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log
>>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log>
>>> 
>>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log
>>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log>>
>>> 
>>> 
>>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt
>>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt>
>>> 
>>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt
>>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt>>
>>> 
>>> 
>>>                Thanks,
>>> 
>>>                Eric
>>>                _______________________________________________
>>>                devel mailing list
>>>                devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
>>> 
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
>>> 
>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>>
>>> 
>>> 
>>>            _______________________________________________
>>>            devel mailing list
>>>            devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>>
>>> 
>>> 
>>>    _______________________________________________
>>>    devel mailing list
>>>    devel@lists.open-mpi.org
>>>    https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
>>> 
>> _______________________________________________
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>> 
> 
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

Reply via email to