Re: [OMPI devel] OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

Jeremy McCaslin Fri, 30 Sep 2016 14:39:50 -0700

Hi,

Can I please be removed from this list?


Thanks,
Jeremy

On Thu, Sep 15, 2016 at 8:44 AM, r...@open-mpi.org <r...@open-mpi.org> wrote:

> I don’t think a collision was the issue here. We were taking the
> mpirun-generated jobid and passing it thru the hash, thus creating an
> incorrect and invalid value. What I’m more surprised by is that it doesn’t
> -always- fail. Only thing I can figure is that, unlike with PMIx, the usock
> oob component doesn’t check the incoming identifier of the connecting proc
> to see if it is someone it knows. So unless you just happened to hash into
> a daemon jobid form, it would accept the connection (even though the name
> wasn’t correct).
>
> I think this should fix the issue. Let’s wait and see
>
>
> On Sep 15, 2016, at 4:47 AM, Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com> wrote:
>
> I just realized i screwed up my test, and i was missing some relevant
> info...
> So on one hand, i fixed a bug in singleton,
> But on the other hand, i cannot tell whether a collision was involved in
> this issue
>
> Cheers,
>
> Gilles
>
> Joshua Ladd <jladd.m...@gmail.com> wrote:
> Great catch, Gilles! Not much of a surprise though.
>
> Indeed, this issue has EVERYTHING to do with how PMIx is calculating the
> jobid, which, in this case, results in hash collisions. ;-P
>
> Josh
>
> On Thursday, September 15, 2016, Gilles Gouaillardet <gil...@rist.or.jp>
> wrote:
>
>> Eric,
>>
>>
>> a bug has been identified, and a patch is available at
>> https://patch-diff.githubusercontent.com/raw/open-mpi/ompi-r
>> elease/pull/1376.patch
>>
>> the bug is specific to singleton mode (e.g. ./a.out vs mpirun -np 1
>> ./a.out), so if applying a patch does not fit your test workflow,
>>
>> it might be easier for you to update it and mpirun -np 1 ./a.out instead
>> of ./a.out
>>
>>
>> basically, increasing verbosity runs some extra code, which include
>> sprintf.
>> so yes, it is possible to crash an app by increasing verbosity by running
>> into a bug that is hidden under normal operation.
>> my intuition suggests this is quite unlikely ... if you can get a core
>> file and a backtrace, we will soon find out
>>
>>
>> Cheers,
>>
>> Gilles
>>
>>
>>
>> On 9/15/2016 2:58 AM, Eric Chamberland wrote:
>>
>>> Ok,
>>>
>>> one test segfaulted *but* I can't tell if it is the *same* bug because
>>> there has been a segfault:
>>>
>>> stderr:
>>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14
>>> .10h38m52s.faultyCerr.Triangle.h_cte_1.txt
>>>
>>> [lorien:190552] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh
>>> path NULL
>>> [lorien:190552] plm:base:set_hnp_name: initial bias 190552 nodename hash
>>> 1366255883
>>> [lorien:190552] plm:base:set_hnp_name: final jobfam 53310
>>> [lorien:190552] [[53310,0],0] plm:rsh_setup on agent ssh : rsh path NULL
>>> [lorien:190552] [[53310,0],0] plm:base:receive start comm
>>> *** Error in `orted': realloc(): invalid next size: 0x0000000001e58770
>>> ***
>>> ...
>>> ...
>>> [lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a
>>> daemon on the local node in file ess_singleton_module.c at line 573
>>> [lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a
>>> daemon on the local node in file ess_singleton_module.c at line 163
>>> *** An error occurred in MPI_Init_thread
>>> *** on a NULL communicator
>>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
>>> ***    and potentially your MPI job)
>>> [lorien:190306] Local abort before MPI_INIT completed completed
>>> successfully, but am not able to aggregate error messages, and not able to
>>> guarantee that all other processes were killed!
>>>
>>> stdout:
>>>
>>> --------------------------------------------------------------------------
>>>
>>> It looks like orte_init failed for some reason; your parallel process is
>>> likely to abort.  There are many reasons that a parallel process can
>>> fail during orte_init; some of which are due to configuration or
>>> environment problems.  This failure appears to be an internal failure;
>>> here's some additional information (which may only be relevant to an
>>> Open MPI developer):
>>>
>>>   orte_ess_init failed
>>>   --> Returned value Unable to start a daemon on the local node (-127)
>>> instead of ORTE_SUCCESS
>>> --------------------------------------------------------------------------
>>>
>>> --------------------------------------------------------------------------
>>>
>>> It looks like MPI_INIT failed for some reason; your parallel process is
>>> likely to abort.  There are many reasons that a parallel process can
>>> fail during MPI_INIT; some of which are due to configuration or
>>> environment
>>> problems.  This failure appears to be an internal failure; here's some
>>> additional information (which may only be relevant to an Open MPI
>>> developer):
>>>
>>>   ompi_mpi_init: ompi_rte_init failed
>>>   --> Returned "Unable to start a daemon on the local node" (-127)
>>> instead of "Success" (0)
>>> --------------------------------------------------------------------------
>>>
>>>
>>> openmpi content of $TMP:
>>>
>>> /tmp/tmp.GoQXICeyJl> ls -la
>>> total 1500
>>> drwx------    3 cmpbib bib     250 Sep 14 13:34 .
>>> drwxrwxrwt  356 root   root  61440 Sep 14 13:45 ..
>>> ...
>>> drwx------ 1848 cmpbib bib   45056 Sep 14 13:34
>>> openmpi-sessions-40031@lorien_0
>>> srw-rw-r--    1 cmpbib bib       0 Sep 14 12:24 pmix-190552
>>>
>>> cmpbib@lorien:/tmp/tmp.GoQXICeyJl/openmpi-sessions-40031@lorien_0> find
>>> . -type f
>>> ./53310/contact.txt
>>>
>>> cat 53310/contact.txt
>>> 3493724160.0;usock;tcp://132.203.7.36:54605
>>> 190552
>>>
>>> egrep 'jobfam|stop' */*/Cerr* ../BIBTV/*/*/*/Cerr*|grep 53310
>>> dev/Test.FonctionsSUPG/Cerr.Triangle.h_cte_1.txt:[lorien:190552]
>>> plm:base:set_hnp_name: final jobfam 53310
>>>
>>> (this is the faulty test)
>>> full egrep:
>>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14
>>> .10h38m52s.egrep.txt
>>>
>>> config.log:
>>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14
>>> .10h38m52s_config.log
>>>
>>> ompi_info:
>>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14
>>> .10h38m52s_ompi_info_all.txt
>>>
>>> Maybe it aborted (instead of giving the other message) while doing the
>>> error because of export OMPI_MCA_plm_base_verbose=5 ?
>>>
>>> Thanks,
>>>
>>> Eric
>>>
>>>
>>> On 14/09/16 10:27 AM, Gilles Gouaillardet wrote:
>>>
>>>> Eric,
>>>>
>>>> do you mean you have a unique $TMP per a.out ?
>>>> or a unique $TMP per "batch" of run ?
>>>>
>>>> in the first case, my understanding is that conflicts cannot happen ...
>>>>
>>>> once you hit the bug, can you please please post the output of the
>>>> failed a.out,
>>>> and run
>>>> egrep 'jobfam|stop'
>>>> on all your logs, so we might spot a conflict
>>>>
>>>> Cheers,
>>>>
>>>> Gilles
>>>>
>>>> On Wednesday, September 14, 2016, Eric Chamberland
>>>> <eric.chamberl...@giref.ulaval.ca
>>>> <mailto:eric.chamberl...@giref.ulaval.ca>> wrote:
>>>>
>>>>     Lucky!
>>>>
>>>>     Since each runs have a specific TMP, I still have it on disc.
>>>>
>>>>     for the faulty run, the TMP variable was:
>>>>
>>>>     TMP=/tmp/tmp.wOv5dkNaSI
>>>>
>>>>     and into $TMP I have:
>>>>
>>>>     openmpi-sessions-40031@lorien_0
>>>>
>>>>     and into this subdirectory I have a bunch of empty dirs:
>>>>
>>>> cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0>
>>>>     ls -la |wc -l
>>>>     1841
>>>>
>>>> cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0>
>>>>     ls -la |more
>>>>     total 68
>>>>     drwx------ 1840 cmpbib bib 45056 Sep 13 03:49 .
>>>>     drwx------    3 cmpbib bib   231 Sep 13 03:50 ..
>>>>     drwx------    2 cmpbib bib     6 Sep 13 02:10 10015
>>>>     drwx------    2 cmpbib bib     6 Sep 13 03:05 10049
>>>>     drwx------    2 cmpbib bib     6 Sep 13 03:15 10052
>>>>     drwx------    2 cmpbib bib     6 Sep 13 02:22 10059
>>>>     drwx------    2 cmpbib bib     6 Sep 13 02:22 10110
>>>>     drwx------    2 cmpbib bib     6 Sep 13 02:41 10114
>>>>     ...
>>>>
>>>>     If I do:
>>>>
>>>>     lsof |grep "openmpi-sessions-40031"
>>>>     lsof: WARNING: can't stat() fuse.gvfsd-fuse file system
>>>>     /run/user/1000/gvfs
>>>>           Output information may be incomplete.
>>>>     lsof: WARNING: can't stat() tracefs file system
>>>>     /sys/kernel/debug/tracing
>>>>           Output information may be incomplete.
>>>>
>>>>     nothing...
>>>>
>>>>     What else may I check?
>>>>
>>>>     Eric
>>>>
>>>>
>>>>     On 14/09/16 08:47 AM, Joshua Ladd wrote:
>>>>
>>>>         Hi, Eric
>>>>
>>>>         I **think** this might be related to the following:
>>>>
>>>>         https://github.com/pmix/master/pull/145
>>>>         <https://github.com/pmix/master/pull/145>
>>>>
>>>>         I'm wondering if you can look into the /tmp directory and see
>>>> if you
>>>>         have a bunch of stale usock files.
>>>>
>>>>         Best,
>>>>
>>>>         Josh
>>>>
>>>>
>>>>         On Wed, Sep 14, 2016 at 1:36 AM, Gilles Gouaillardet
>>>>         <gil...@rist.or.jp
>>>>         <mailto:gil...@rist.or.jp>> wrote:
>>>>
>>>>             Eric,
>>>>
>>>>
>>>>             can you please provide more information on how your tests
>>>>         are launched ?
>>>>
>>>>             do you
>>>>
>>>>             mpirun -np 1 ./a.out
>>>>
>>>>             or do you simply
>>>>
>>>>             ./a.out
>>>>
>>>>
>>>>             do you use a batch manager ? if yes, which one ?
>>>>
>>>>             do you run one test per job ? or multiple tests per job ?
>>>>
>>>>             how are these tests launched ?
>>>>
>>>>
>>>>             do the test that crashes use MPI_Comm_spawn ?
>>>>
>>>>             i am surprised by the process name [[9325,5754],0], which
>>>>         suggests
>>>>             there MPI_Comm_spawn was called 5753 times (!)
>>>>
>>>>
>>>>             can you also run
>>>>
>>>>             hostname
>>>>
>>>>             on the 'lorien' host ?
>>>>
>>>>             if you configure'd Open MPI with --enable-debug, can you
>>>>
>>>>             export OMPI_MCA_plm_base_verbose 5
>>>>
>>>>             then run one test and post the logs ?
>>>>
>>>>
>>>>             from orte_plm_base_set_hnp_name(), "lorien" and pid 142766
>>>>         should
>>>>             produce job family 5576 (but you get 9325)
>>>>
>>>>             the discrepancy could be explained by the use of a batch
>>>> manager
>>>>             and/or a full hostname i am unaware of.
>>>>
>>>>
>>>>             orte_plm_base_set_hnp_name() generate a 16 bits job family
>>>>         from the
>>>>             (32 bits hash of the) hostname and the mpirun (32 bits ?)
>>>> pid.
>>>>
>>>>             so strictly speaking, it is possible two jobs launched on
>>>>         the same
>>>>             node are assigned the same 16 bits job family.
>>>>
>>>>
>>>>             the easiest way to detect this could be to
>>>>
>>>>             - edit orte/mca/plm/base/plm_base_jobid.c
>>>>
>>>>             and replace
>>>>
>>>>                 OPAL_OUTPUT_VERBOSE((5,
>>>>         orte_plm_base_framework.framework_output,
>>>>                                      "plm:base:set_hnp_name: final
>>>>         jobfam %lu",
>>>>                                      (unsigned long)jobfam));
>>>>
>>>>             with
>>>>
>>>>                 OPAL_OUTPUT_VERBOSE((4,
>>>>         orte_plm_base_framework.framework_output,
>>>>                                      "plm:base:set_hnp_name: final
>>>>         jobfam %lu",
>>>>                                      (unsigned long)jobfam));
>>>>
>>>>             configure Open MPI with --enable-debug and rebuild
>>>>
>>>>             and then
>>>>
>>>>             export OMPI_MCA_plm_base_verbose=4
>>>>
>>>>             and run your tests.
>>>>
>>>>
>>>>             when the problem occurs, you will be able to check which
>>>> pids
>>>>             produced the faulty jobfam, and that could hint to a
>>>> conflict.
>>>>
>>>>
>>>>             Cheers,
>>>>
>>>>
>>>>             Gilles
>>>>
>>>>
>>>>
>>>>             On 9/14/2016 12:35 AM, Eric Chamberland wrote:
>>>>
>>>>                 Hi,
>>>>
>>>>                 It is the third time this happened into the last 10
>>>> days.
>>>>
>>>>                 While running nighlty tests (~2200), we have one or two
>>>>         tests
>>>>                 that fails at the very beginning with this strange
>>>> error:
>>>>
>>>>                 [lorien:142766] [[9325,5754],0]
>>>> usock_peer_recv_connect_ack:
>>>>                 received unexpected process identifier [[9325,0],0] from
>>>>                 [[5590,0],0]
>>>>
>>>>                 But I can't reproduce the problem right now... ie: If I
>>>>         launch
>>>>                 this test alone "by hand", it is successful... the same
>>>>         test was
>>>>                 successful yesterday...
>>>>
>>>>                 Is there some kind of "race condition" that can happen
>>>>         on the
>>>>                 creation of "tmp" files if many tests runs together on
>>>>         the same
>>>>                 node? (we are oversubcribing even sequential runs...)
>>>>
>>>>                 Here are the build logs:
>>>>
>>>>
>>>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13
>>>> .01h16m01s_config.log
>>>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.1
>>>> 3.01h16m01s_config.log>
>>>>
>>>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.1
>>>> 3.01h16m01s_config.log
>>>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.1
>>>> 3.01h16m01s_config.log>>
>>>>
>>>>
>>>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13
>>>> .01h16m01s_ompi_info_all.txt
>>>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.1
>>>> 3.01h16m01s_ompi_info_all.txt>
>>>>
>>>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.1
>>>> 3.01h16m01s_ompi_info_all.txt
>>>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.1
>>>> 3.01h16m01s_ompi_info_all.txt>>
>>>>
>>>>
>>>>                 Thanks,
>>>>
>>>>                 Eric
>>>>                 _______________________________________________
>>>>                 devel mailing list
>>>>                 devel@lists.open-mpi.org <mailto:de...@lists.open-mpi.o
>>>> rg>
>>>>
>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
>>>>
>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>>
>>>>
>>>>
>>>>             _______________________________________________
>>>>             devel mailing list
>>>>             devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>>
>>>>
>>>>
>>>>     _______________________________________________
>>>>     devel mailing list
>>>>     devel@lists.open-mpi.org
>>>>     https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
>>>>
>>>> _______________________________________________
>>> devel mailing list
>>> devel@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>>
>>>
>> _______________________________________________
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
>
>
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>



-- 
JM

_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

Reply via email to