Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

r...@open-mpi.org Thu, 15 Sep 2016 08:47:13 -0700

It’s okay - it was just confusing

This actually wound up having nothing to do with how the jobid is generated. 
The root cause of the problem was that we took an mpirun-generated jobid, and 
then mistakenly passed it back thru a hash function instead of just using it. 
So we hashed a perfectly good jobid.


What is puzzling is how it could ever have worked, yet the user said it only 
occasionally messed things up enough to cause breakage. You would think that 
hashing a valid jobid would create an unusable mess, but that doesn’t appear to 
be a definitive result.

<shrug> probably indicative of the weakness of the hash :-)


> On Sep 15, 2016, at 8:34 AM, Joshua Ladd <jladd.m...@gmail.com> wrote:
> 
> Ralph,
> 
> We love PMIx :). In this context, when I say PMIx, I am referring to the PMIx 
> framework in OMPI/OPAL, not the standalone PMIx library. Sorry that wasn't 
> clear.
> 
> Josh 
> 
> On Thu, Sep 15, 2016 at 10:07 AM, r...@open-mpi.org 
> <mailto:r...@open-mpi.org> <r...@open-mpi.org <mailto:r...@open-mpi.org>> 
> wrote:
> I don’t understand this fascination with PMIx. PMIx didn’t calculate this 
> jobid - OMPI did. Yes, it is in the opal/pmix layer, but it had -nothing- to 
> do with PMIx.
> 
> So why do you want to continue to blame PMIx for this problem??
> 
> 
>> On Sep 15, 2016, at 4:29 AM, Joshua Ladd <jladd.m...@gmail.com 
>> <mailto:jladd.m...@gmail.com>> wrote:
>> 
>> Great catch, Gilles! Not much of a surprise though. 
>> 
>> Indeed, this issue has EVERYTHING to do with how PMIx is calculating the 
>> jobid, which, in this case, results in hash collisions. ;-P
>> 
>> Josh
>> 
>> On Thursday, September 15, 2016, Gilles Gouaillardet <gil...@rist.or.jp 
>> <mailto:gil...@rist.or.jp>> wrote:
>> Eric,
>> 
>> 
>> a bug has been identified, and a patch is available at 
>> https://patch-diff.githubusercontent.com/raw/open-mpi/ompi-release/pull/1376.patch
>>  
>> <https://patch-diff.githubusercontent.com/raw/open-mpi/ompi-release/pull/1376.patch>
>> 
>> the bug is specific to singleton mode (e.g. ./a.out vs mpirun -np 1 
>> ./a.out), so if applying a patch does not fit your test workflow,
>> 
>> it might be easier for you to update it and mpirun -np 1 ./a.out instead of 
>> ./a.out
>> 
>> 
>> basically, increasing verbosity runs some extra code, which include sprintf.
>> so yes, it is possible to crash an app by increasing verbosity by running 
>> into a bug that is hidden under normal operation.
>> my intuition suggests this is quite unlikely ... if you can get a core file 
>> and a backtrace, we will soon find out
>> 
>> 
>> Cheers,
>> 
>> Gilles
>> 
>> 
>> 
>> On 9/15/2016 2:58 AM, Eric Chamberland wrote:
>> Ok,
>> 
>> one test segfaulted *but* I can't tell if it is the *same* bug because there 
>> has been a segfault:
>> 
>> stderr:
>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.faultyCerr.Triangle.h_cte_1.txt
>>  
>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.faultyCerr.Triangle.h_cte_1.txt>
>>  
>> 
>> [lorien:190552] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path 
>> NULL
>> [lorien:190552] plm:base:set_hnp_name: initial bias 190552 nodename hash 
>> 1366255883
>> [lorien:190552] plm:base:set_hnp_name: final jobfam 53310
>> [lorien:190552] [[53310,0],0] plm:rsh_setup on agent ssh : rsh path NULL
>> [lorien:190552] [[53310,0],0] plm:base:receive start comm
>> *** Error in `orted': realloc(): invalid next size: 0x0000000001e58770 ***
>> ...
>> ...
>> [lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon 
>> on the local node in file ess_singleton_module.c at line 573
>> [lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon 
>> on the local node in file ess_singleton_module.c at line 163
>> *** An error occurred in MPI_Init_thread
>> *** on a NULL communicator
>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
>> ***    and potentially your MPI job)
>> [lorien:190306] Local abort before MPI_INIT completed completed 
>> successfully, but am not able to aggregate error messages, and not able to 
>> guarantee that all other processes were killed!
>> 
>> stdout:
>> 
>> -------------------------------------------------------------------------- 
>> It looks like orte_init failed for some reason; your parallel process is
>> likely to abort.  There are many reasons that a parallel process can
>> fail during orte_init; some of which are due to configuration or
>> environment problems.  This failure appears to be an internal failure;
>> here's some additional information (which may only be relevant to an
>> Open MPI developer):
>> 
>>   orte_ess_init failed
>>   --> Returned value Unable to start a daemon on the local node (-127) 
>> instead of ORTE_SUCCESS
>> -------------------------------------------------------------------------- 
>> -------------------------------------------------------------------------- 
>> It looks like MPI_INIT failed for some reason; your parallel process is
>> likely to abort.  There are many reasons that a parallel process can
>> fail during MPI_INIT; some of which are due to configuration or environment
>> problems.  This failure appears to be an internal failure; here's some
>> additional information (which may only be relevant to an Open MPI
>> developer):
>> 
>>   ompi_mpi_init: ompi_rte_init failed
>>   --> Returned "Unable to start a daemon on the local node" (-127) instead 
>> of "Success" (0)
>> -------------------------------------------------------------------------- 
>> 
>> openmpi content of $TMP:
>> 
>> /tmp/tmp.GoQXICeyJl> ls -la
>> total 1500
>> drwx------    3 cmpbib bib     250 Sep 14 13:34 .
>> drwxrwxrwt  356 root   root  61440 Sep 14 13:45 ..
>> ...
>> drwx------ 1848 cmpbib bib   45056 Sep 14 13:34 
>> openmpi-sessions-40031@lorien_0
>> srw-rw-r--    1 cmpbib bib       0 Sep 14 12:24 pmix-190552
>> 
>> cmpbib@lorien:/tmp/tmp.GoQXICeyJl/openmpi-sessions-40031@lorien_0> find . 
>> -type f
>> ./53310/contact.txt
>> 
>> cat 53310/contact.txt
>> 3493724160.0;usock;tcp://132.203.7.36:54605 <http://132.203.7.36:54605/>
>> 190552
>> 
>> egrep 'jobfam|stop' */*/Cerr* ../BIBTV/*/*/*/Cerr*|grep 53310
>> dev/Test.FonctionsSUPG/Cerr.Triangle.h_cte_1.txt:[lorien:190552] 
>> plm:base:set_hnp_name: final jobfam 53310
>> 
>> (this is the faulty test)
>> full egrep:
>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.egrep.txt
>>  
>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.egrep.txt>
>>  
>> 
>> config.log:
>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s_config.log
>>  
>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s_config.log>
>>  
>> 
>> ompi_info:
>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s_ompi_info_all.txt
>>  
>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s_ompi_info_all.txt>
>>  
>> 
>> Maybe it aborted (instead of giving the other message) while doing the error 
>> because of export OMPI_MCA_plm_base_verbose=5 ?
>> 
>> Thanks,
>> 
>> Eric
>> 
>> 
>> On 14/09/16 10:27 AM, Gilles Gouaillardet wrote:
>> Eric,
>> 
>> do you mean you have a unique $TMP per a.out ?
>> or a unique $TMP per "batch" of run ?
>> 
>> in the first case, my understanding is that conflicts cannot happen ...
>> 
>> once you hit the bug, can you please please post the output of the
>> failed a.out,
>> and run
>> egrep 'jobfam|stop'
>> on all your logs, so we might spot a conflict
>> 
>> Cheers,
>> 
>> Gilles
>> 
>> On Wednesday, September 14, 2016, Eric Chamberland
>> <eric.chamberl...@giref.ulaval.ca <>
>> <mailto:eric.chamberl...@giref.ulaval.ca <>>> wrote:
>> 
>>     Lucky!
>> 
>>     Since each runs have a specific TMP, I still have it on disc.
>> 
>>     for the faulty run, the TMP variable was:
>> 
>>     TMP=/tmp/tmp.wOv5dkNaSI
>> 
>>     and into $TMP I have:
>> 
>>     openmpi-sessions-40031@lorien_0
>> 
>>     and into this subdirectory I have a bunch of empty dirs:
>> 
>> cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0>
>>     ls -la |wc -l
>>     1841
>> 
>> cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0>
>>     ls -la |more
>>     total 68
>>     drwx------ 1840 cmpbib bib 45056 Sep 13 03:49 .
>>     drwx------    3 cmpbib bib   231 Sep 13 03:50 ..
>>     drwx------    2 cmpbib bib     6 Sep 13 02:10 10015
>>     drwx------    2 cmpbib bib     6 Sep 13 03:05 10049
>>     drwx------    2 cmpbib bib     6 Sep 13 03:15 10052
>>     drwx------    2 cmpbib bib     6 Sep 13 02:22 10059
>>     drwx------    2 cmpbib bib     6 Sep 13 02:22 10110
>>     drwx------    2 cmpbib bib     6 Sep 13 02:41 10114
>>     ...
>> 
>>     If I do:
>> 
>>     lsof |grep "openmpi-sessions-40031"
>>     lsof: WARNING: can't stat() fuse.gvfsd-fuse file system
>>     /run/user/1000/gvfs
>>           Output information may be incomplete.
>>     lsof: WARNING: can't stat() tracefs file system
>>     /sys/kernel/debug/tracing
>>           Output information may be incomplete.
>> 
>>     nothing...
>> 
>>     What else may I check?
>> 
>>     Eric
>> 
>> 
>>     On 14/09/16 08:47 AM, Joshua Ladd wrote:
>> 
>>         Hi, Eric
>> 
>>         I **think** this might be related to the following:
>> 
>>         https://github.com/pmix/master/pull/145 
>> <https://github.com/pmix/master/pull/145>
>>         <https://github.com/pmix/master/pull/145 
>> <https://github.com/pmix/master/pull/145>>
>> 
>>         I'm wondering if you can look into the /tmp directory and see if you
>>         have a bunch of stale usock files.
>> 
>>         Best,
>> 
>>         Josh
>> 
>> 
>>         On Wed, Sep 14, 2016 at 1:36 AM, Gilles Gouaillardet
>>         <gil...@rist.or.jp <>
>>         <mailto:gil...@rist.or.jp <>>> wrote:
>> 
>>             Eric,
>> 
>> 
>>             can you please provide more information on how your tests
>>         are launched ?
>> 
>>             do you
>> 
>>             mpirun -np 1 ./a.out
>> 
>>             or do you simply
>> 
>>             ./a.out
>> 
>> 
>>             do you use a batch manager ? if yes, which one ?
>> 
>>             do you run one test per job ? or multiple tests per job ?
>> 
>>             how are these tests launched ?
>> 
>> 
>>             do the test that crashes use MPI_Comm_spawn ?
>> 
>>             i am surprised by the process name [[9325,5754],0], which
>>         suggests
>>             there MPI_Comm_spawn was called 5753 times (!)
>> 
>> 
>>             can you also run
>> 
>>             hostname
>> 
>>             on the 'lorien' host ?
>> 
>>             if you configure'd Open MPI with --enable-debug, can you
>> 
>>             export OMPI_MCA_plm_base_verbose 5
>> 
>>             then run one test and post the logs ?
>> 
>> 
>>             from orte_plm_base_set_hnp_name(), "lorien" and pid 142766
>>         should
>>             produce job family 5576 (but you get 9325)
>> 
>>             the discrepancy could be explained by the use of a batch manager
>>             and/or a full hostname i am unaware of.
>> 
>> 
>>             orte_plm_base_set_hnp_name() generate a 16 bits job family
>>         from the
>>             (32 bits hash of the) hostname and the mpirun (32 bits ?) pid.
>> 
>>             so strictly speaking, it is possible two jobs launched on
>>         the same
>>             node are assigned the same 16 bits job family.
>> 
>> 
>>             the easiest way to detect this could be to
>> 
>>             - edit orte/mca/plm/base/plm_base_jobid.c
>> 
>>             and replace
>> 
>>                 OPAL_OUTPUT_VERBOSE((5,
>>         orte_plm_base_framework.framework_output,
>>                                      "plm:base:set_hnp_name: final
>>         jobfam %lu",
>>                                      (unsigned long)jobfam));
>> 
>>             with
>> 
>>                 OPAL_OUTPUT_VERBOSE((4,
>>         orte_plm_base_framework.framework_output,
>>                                      "plm:base:set_hnp_name: final
>>         jobfam %lu",
>>                                      (unsigned long)jobfam));
>> 
>>             configure Open MPI with --enable-debug and rebuild
>> 
>>             and then
>> 
>>             export OMPI_MCA_plm_base_verbose=4
>> 
>>             and run your tests.
>> 
>> 
>>             when the problem occurs, you will be able to check which pids
>>             produced the faulty jobfam, and that could hint to a conflict.
>> 
>> 
>>             Cheers,
>> 
>> 
>>             Gilles
>> 
>> 
>> 
>>             On 9/14/2016 12:35 AM, Eric Chamberland wrote:
>> 
>>                 Hi,
>> 
>>                 It is the third time this happened into the last 10 days.
>> 
>>                 While running nighlty tests (~2200), we have one or two
>>         tests
>>                 that fails at the very beginning with this strange error:
>> 
>>                 [lorien:142766] [[9325,5754],0] usock_peer_recv_connect_ack:
>>                 received unexpected process identifier [[9325,0],0] from
>>                 [[5590,0],0]
>> 
>>                 But I can't reproduce the problem right now... ie: If I
>>         launch
>>                 this test alone "by hand", it is successful... the same
>>         test was
>>                 successful yesterday...
>> 
>>                 Is there some kind of "race condition" that can happen
>>         on the
>>                 creation of "tmp" files if many tests runs together on
>>         the same
>>                 node? (we are oversubcribing even sequential runs...)
>> 
>>                 Here are the build logs:
>> 
>> 
>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log
>>  
>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log>
>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log
>>  
>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log>>
>> 
>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log
>>  
>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log>
>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log
>>  
>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log>>>
>> 
>> 
>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt
>>  
>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt>
>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt
>>  
>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt>>
>> 
>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt
>>  
>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt>
>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt
>>  
>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt>>>
>> 
>> 
>>                 Thanks,
>> 
>>                 Eric
>>                 _______________________________________________
>>                 devel mailing list
>>                 devel@lists.open-mpi.org <> <mailto:devel@lists.open-mpi.org 
>> <>>
>> 
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>>
>> 
>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>>>
>> 
>> 
>>             _______________________________________________
>>             devel mailing list
>>             devel@lists.open-mpi.org <> <mailto:devel@lists.open-mpi.org <>>
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>>
>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>>>
>> 
>> 
>>     _______________________________________________
>>     devel mailing list
>>     devel@lists.open-mpi.org <>
>>     https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>>
>> 
>> _______________________________________________
>> devel mailing list
>> devel@lists.open-mpi.org <>
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> devel@lists.open-mpi.org <>
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
>> _______________________________________________
>> devel mailing list
>> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
> 
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
> 
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

Reply via email to