Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

Joshua Ladd Thu, 15 Sep 2016 04:30:53 -0700

Great catch, Gilles! Not much of a surprise though.

Indeed, this issue has EVERYTHING to do with how PMIx is calculating the
jobid, which, in this case, results in hash collisions. ;-P


Josh

On Thursday, September 15, 2016, Gilles Gouaillardet <gil...@rist.or.jp>
wrote:

> Eric,
>
>
> a bug has been identified, and a patch is available at
> https://patch-diff.githubusercontent.com/raw/open-mpi/ompi-
> release/pull/1376.patch
>
> the bug is specific to singleton mode (e.g. ./a.out vs mpirun -np 1
> ./a.out), so if applying a patch does not fit your test workflow,
>
> it might be easier for you to update it and mpirun -np 1 ./a.out instead
> of ./a.out
>
>
> basically, increasing verbosity runs some extra code, which include
> sprintf.
> so yes, it is possible to crash an app by increasing verbosity by running
> into a bug that is hidden under normal operation.
> my intuition suggests this is quite unlikely ... if you can get a core
> file and a backtrace, we will soon find out
>
>
> Cheers,
>
> Gilles
>
>
>
> On 9/15/2016 2:58 AM, Eric Chamberland wrote:
>
>> Ok,
>>
>> one test segfaulted *but* I can't tell if it is the *same* bug because
>> there has been a segfault:
>>
>> stderr:
>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14
>> .10h38m52s.faultyCerr.Triangle.h_cte_1.txt
>>
>> [lorien:190552] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh
>> path NULL
>> [lorien:190552] plm:base:set_hnp_name: initial bias 190552 nodename hash
>> 1366255883
>> [lorien:190552] plm:base:set_hnp_name: final jobfam 53310
>> [lorien:190552] [[53310,0],0] plm:rsh_setup on agent ssh : rsh path NULL
>> [lorien:190552] [[53310,0],0] plm:base:receive start comm
>> *** Error in `orted': realloc(): invalid next size: 0x0000000001e58770 ***
>> ...
>> ...
>> [lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a
>> daemon on the local node in file ess_singleton_module.c at line 573
>> [lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a
>> daemon on the local node in file ess_singleton_module.c at line 163
>> *** An error occurred in MPI_Init_thread
>> *** on a NULL communicator
>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
>> ***    and potentially your MPI job)
>> [lorien:190306] Local abort before MPI_INIT completed completed
>> successfully, but am not able to aggregate error messages, and not able to
>> guarantee that all other processes were killed!
>>
>> stdout:
>>
>> --------------------------------------------------------------------------
>>
>> It looks like orte_init failed for some reason; your parallel process is
>> likely to abort.  There are many reasons that a parallel process can
>> fail during orte_init; some of which are due to configuration or
>> environment problems.  This failure appears to be an internal failure;
>> here's some additional information (which may only be relevant to an
>> Open MPI developer):
>>
>>   orte_ess_init failed
>>   --> Returned value Unable to start a daemon on the local node (-127)
>> instead of ORTE_SUCCESS
>> --------------------------------------------------------------------------
>>
>> --------------------------------------------------------------------------
>>
>> It looks like MPI_INIT failed for some reason; your parallel process is
>> likely to abort.  There are many reasons that a parallel process can
>> fail during MPI_INIT; some of which are due to configuration or
>> environment
>> problems.  This failure appears to be an internal failure; here's some
>> additional information (which may only be relevant to an Open MPI
>> developer):
>>
>>   ompi_mpi_init: ompi_rte_init failed
>>   --> Returned "Unable to start a daemon on the local node" (-127)
>> instead of "Success" (0)
>> --------------------------------------------------------------------------
>>
>>
>> openmpi content of $TMP:
>>
>> /tmp/tmp.GoQXICeyJl> ls -la
>> total 1500
>> drwx------    3 cmpbib bib     250 Sep 14 13:34 .
>> drwxrwxrwt  356 root   root  61440 Sep 14 13:45 ..
>> ...
>> drwx------ 1848 cmpbib bib   45056 Sep 14 13:34
>> openmpi-sessions-40031@lorien_0
>> srw-rw-r--    1 cmpbib bib       0 Sep 14 12:24 pmix-190552
>>
>> cmpbib@lorien:/tmp/tmp.GoQXICeyJl/openmpi-sessions-40031@lorien_0> find
>> . -type f
>> ./53310/contact.txt
>>
>> cat 53310/contact.txt
>> 3493724160.0;usock;tcp://132.203.7.36:54605
>> 190552
>>
>> egrep 'jobfam|stop' */*/Cerr* ../BIBTV/*/*/*/Cerr*|grep 53310
>> dev/Test.FonctionsSUPG/Cerr.Triangle.h_cte_1.txt:[lorien:190552]
>> plm:base:set_hnp_name: final jobfam 53310
>>
>> (this is the faulty test)
>> full egrep:
>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14
>> .10h38m52s.egrep.txt
>>
>> config.log:
>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14
>> .10h38m52s_config.log
>>
>> ompi_info:
>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14
>> .10h38m52s_ompi_info_all.txt
>>
>> Maybe it aborted (instead of giving the other message) while doing the
>> error because of export OMPI_MCA_plm_base_verbose=5 ?
>>
>> Thanks,
>>
>> Eric
>>
>>
>> On 14/09/16 10:27 AM, Gilles Gouaillardet wrote:
>>
>>> Eric,
>>>
>>> do you mean you have a unique $TMP per a.out ?
>>> or a unique $TMP per "batch" of run ?
>>>
>>> in the first case, my understanding is that conflicts cannot happen ...
>>>
>>> once you hit the bug, can you please please post the output of the
>>> failed a.out,
>>> and run
>>> egrep 'jobfam|stop'
>>> on all your logs, so we might spot a conflict
>>>
>>> Cheers,
>>>
>>> Gilles
>>>
>>> On Wednesday, September 14, 2016, Eric Chamberland
>>> <eric.chamberl...@giref.ulaval.ca
>>> <mailto:eric.chamberl...@giref.ulaval.ca>> wrote:
>>>
>>>     Lucky!
>>>
>>>     Since each runs have a specific TMP, I still have it on disc.
>>>
>>>     for the faulty run, the TMP variable was:
>>>
>>>     TMP=/tmp/tmp.wOv5dkNaSI
>>>
>>>     and into $TMP I have:
>>>
>>>     openmpi-sessions-40031@lorien_0
>>>
>>>     and into this subdirectory I have a bunch of empty dirs:
>>>
>>> cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0>
>>>     ls -la |wc -l
>>>     1841
>>>
>>> cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0>
>>>     ls -la |more
>>>     total 68
>>>     drwx------ 1840 cmpbib bib 45056 Sep 13 03:49 .
>>>     drwx------    3 cmpbib bib   231 Sep 13 03:50 ..
>>>     drwx------    2 cmpbib bib     6 Sep 13 02:10 10015
>>>     drwx------    2 cmpbib bib     6 Sep 13 03:05 10049
>>>     drwx------    2 cmpbib bib     6 Sep 13 03:15 10052
>>>     drwx------    2 cmpbib bib     6 Sep 13 02:22 10059
>>>     drwx------    2 cmpbib bib     6 Sep 13 02:22 10110
>>>     drwx------    2 cmpbib bib     6 Sep 13 02:41 10114
>>>     ...
>>>
>>>     If I do:
>>>
>>>     lsof |grep "openmpi-sessions-40031"
>>>     lsof: WARNING: can't stat() fuse.gvfsd-fuse file system
>>>     /run/user/1000/gvfs
>>>           Output information may be incomplete.
>>>     lsof: WARNING: can't stat() tracefs file system
>>>     /sys/kernel/debug/tracing
>>>           Output information may be incomplete.
>>>
>>>     nothing...
>>>
>>>     What else may I check?
>>>
>>>     Eric
>>>
>>>
>>>     On 14/09/16 08:47 AM, Joshua Ladd wrote:
>>>
>>>         Hi, Eric
>>>
>>>         I **think** this might be related to the following:
>>>
>>>         https://github.com/pmix/master/pull/145
>>>         <https://github.com/pmix/master/pull/145>
>>>
>>>         I'm wondering if you can look into the /tmp directory and see if
>>> you
>>>         have a bunch of stale usock files.
>>>
>>>         Best,
>>>
>>>         Josh
>>>
>>>
>>>         On Wed, Sep 14, 2016 at 1:36 AM, Gilles Gouaillardet
>>>         <gil...@rist.or.jp
>>>         <mailto:gil...@rist.or.jp>> wrote:
>>>
>>>             Eric,
>>>
>>>
>>>             can you please provide more information on how your tests
>>>         are launched ?
>>>
>>>             do you
>>>
>>>             mpirun -np 1 ./a.out
>>>
>>>             or do you simply
>>>
>>>             ./a.out
>>>
>>>
>>>             do you use a batch manager ? if yes, which one ?
>>>
>>>             do you run one test per job ? or multiple tests per job ?
>>>
>>>             how are these tests launched ?
>>>
>>>
>>>             do the test that crashes use MPI_Comm_spawn ?
>>>
>>>             i am surprised by the process name [[9325,5754],0], which
>>>         suggests
>>>             there MPI_Comm_spawn was called 5753 times (!)
>>>
>>>
>>>             can you also run
>>>
>>>             hostname
>>>
>>>             on the 'lorien' host ?
>>>
>>>             if you configure'd Open MPI with --enable-debug, can you
>>>
>>>             export OMPI_MCA_plm_base_verbose 5
>>>
>>>             then run one test and post the logs ?
>>>
>>>
>>>             from orte_plm_base_set_hnp_name(), "lorien" and pid 142766
>>>         should
>>>             produce job family 5576 (but you get 9325)
>>>
>>>             the discrepancy could be explained by the use of a batch
>>> manager
>>>             and/or a full hostname i am unaware of.
>>>
>>>
>>>             orte_plm_base_set_hnp_name() generate a 16 bits job family
>>>         from the
>>>             (32 bits hash of the) hostname and the mpirun (32 bits ?)
>>> pid.
>>>
>>>             so strictly speaking, it is possible two jobs launched on
>>>         the same
>>>             node are assigned the same 16 bits job family.
>>>
>>>
>>>             the easiest way to detect this could be to
>>>
>>>             - edit orte/mca/plm/base/plm_base_jobid.c
>>>
>>>             and replace
>>>
>>>                 OPAL_OUTPUT_VERBOSE((5,
>>>         orte_plm_base_framework.framework_output,
>>>                                      "plm:base:set_hnp_name: final
>>>         jobfam %lu",
>>>                                      (unsigned long)jobfam));
>>>
>>>             with
>>>
>>>                 OPAL_OUTPUT_VERBOSE((4,
>>>         orte_plm_base_framework.framework_output,
>>>                                      "plm:base:set_hnp_name: final
>>>         jobfam %lu",
>>>                                      (unsigned long)jobfam));
>>>
>>>             configure Open MPI with --enable-debug and rebuild
>>>
>>>             and then
>>>
>>>             export OMPI_MCA_plm_base_verbose=4
>>>
>>>             and run your tests.
>>>
>>>
>>>             when the problem occurs, you will be able to check which pids
>>>             produced the faulty jobfam, and that could hint to a
>>> conflict.
>>>
>>>
>>>             Cheers,
>>>
>>>
>>>             Gilles
>>>
>>>
>>>
>>>             On 9/14/2016 12:35 AM, Eric Chamberland wrote:
>>>
>>>                 Hi,
>>>
>>>                 It is the third time this happened into the last 10 days.
>>>
>>>                 While running nighlty tests (~2200), we have one or two
>>>         tests
>>>                 that fails at the very beginning with this strange error:
>>>
>>>                 [lorien:142766] [[9325,5754],0]
>>> usock_peer_recv_connect_ack:
>>>                 received unexpected process identifier [[9325,0],0] from
>>>                 [[5590,0],0]
>>>
>>>                 But I can't reproduce the problem right now... ie: If I
>>>         launch
>>>                 this test alone "by hand", it is successful... the same
>>>         test was
>>>                 successful yesterday...
>>>
>>>                 Is there some kind of "race condition" that can happen
>>>         on the
>>>                 creation of "tmp" files if many tests runs together on
>>>         the same
>>>                 node? (we are oversubcribing even sequential runs...)
>>>
>>>                 Here are the build logs:
>>>
>>>
>>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13
>>> .01h16m01s_config.log
>>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.1
>>> 3.01h16m01s_config.log>
>>>
>>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.1
>>> 3.01h16m01s_config.log
>>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.1
>>> 3.01h16m01s_config.log>>
>>>
>>>
>>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13
>>> .01h16m01s_ompi_info_all.txt
>>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.1
>>> 3.01h16m01s_ompi_info_all.txt>
>>>
>>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.1
>>> 3.01h16m01s_ompi_info_all.txt
>>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.1
>>> 3.01h16m01s_ompi_info_all.txt>>
>>>
>>>
>>>                 Thanks,
>>>
>>>                 Eric
>>>                 _______________________________________________
>>>                 devel mailing list
>>>                 devel@lists.open-mpi.org <mailto:de...@lists.open-mpi.
>>> org>
>>>
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
>>>
>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>>
>>>
>>>
>>>             _______________________________________________
>>>             devel mailing list
>>>             devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>>
>>>
>>>
>>>     _______________________________________________
>>>     devel mailing list
>>>     devel@lists.open-mpi.org
>>>     https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
>>>
>>> _______________________________________________
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>
>>
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>

_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

Reply via email to