Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

[email protected] Thu, 15 Sep 2016 07:16:33 -0700

I don’t understand this fascination with PMIx. PMIx didn’t calculate this jobid 
- OMPI did. Yes, it is in the opal/pmix layer, but it had -nothing- to do with 
PMIx.


So why do you want to continue to blame PMIx for this problem??


> On Sep 15, 2016, at 4:29 AM, Joshua Ladd <[email protected]> wrote:
> 
> Great catch, Gilles! Not much of a surprise though. 
> 
> Indeed, this issue has EVERYTHING to do with how PMIx is calculating the 
> jobid, which, in this case, results in hash collisions. ;-P
> 
> Josh
> 
> On Thursday, September 15, 2016, Gilles Gouaillardet <[email protected] 
> <mailto:[email protected]>> wrote:
> Eric,
> 
> 
> a bug has been identified, and a patch is available at 
> https://patch-diff.githubusercontent.com/raw/open-mpi/ompi-release/pull/1376.patch
>  
> <https://patch-diff.githubusercontent.com/raw/open-mpi/ompi-release/pull/1376.patch>
> 
> the bug is specific to singleton mode (e.g. ./a.out vs mpirun -np 1 ./a.out), 
> so if applying a patch does not fit your test workflow,
> 
> it might be easier for you to update it and mpirun -np 1 ./a.out instead of 
> ./a.out
> 
> 
> basically, increasing verbosity runs some extra code, which include sprintf.
> so yes, it is possible to crash an app by increasing verbosity by running 
> into a bug that is hidden under normal operation.
> my intuition suggests this is quite unlikely ... if you can get a core file 
> and a backtrace, we will soon find out
> 
> 
> Cheers,
> 
> Gilles
> 
> 
> 
> On 9/15/2016 2:58 AM, Eric Chamberland wrote:
> Ok,
> 
> one test segfaulted *but* I can't tell if it is the *same* bug because there 
> has been a segfault:
> 
> stderr:
> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.faultyCerr.Triangle.h_cte_1.txt
>  
> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.faultyCerr.Triangle.h_cte_1.txt>
>  
> 
> [lorien:190552] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path 
> NULL
> [lorien:190552] plm:base:set_hnp_name: initial bias 190552 nodename hash 
> 1366255883
> [lorien:190552] plm:base:set_hnp_name: final jobfam 53310
> [lorien:190552] [[53310,0],0] plm:rsh_setup on agent ssh : rsh path NULL
> [lorien:190552] [[53310,0],0] plm:base:receive start comm
> *** Error in `orted': realloc(): invalid next size: 0x0000000001e58770 ***
> ...
> ...
> [lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon 
> on the local node in file ess_singleton_module.c at line 573
> [lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon 
> on the local node in file ess_singleton_module.c at line 163
> *** An error occurred in MPI_Init_thread
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> ***    and potentially your MPI job)
> [lorien:190306] Local abort before MPI_INIT completed completed successfully, 
> but am not able to aggregate error messages, and not able to guarantee that 
> all other processes were killed!
> 
> stdout:
> 
> -------------------------------------------------------------------------- 
> It looks like orte_init failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems.  This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
> 
>   orte_ess_init failed
>   --> Returned value Unable to start a daemon on the local node (-127) 
> instead of ORTE_SUCCESS
> -------------------------------------------------------------------------- 
> -------------------------------------------------------------------------- 
> It looks like MPI_INIT failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during MPI_INIT; some of which are due to configuration or environment
> problems.  This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):
> 
>   ompi_mpi_init: ompi_rte_init failed
>   --> Returned "Unable to start a daemon on the local node" (-127) instead of 
> "Success" (0)
> -------------------------------------------------------------------------- 
> 
> openmpi content of $TMP:
> 
> /tmp/tmp.GoQXICeyJl> ls -la
> total 1500
> drwx------    3 cmpbib bib     250 Sep 14 13:34 .
> drwxrwxrwt  356 root   root  61440 Sep 14 13:45 ..
> ...
> drwx------ 1848 cmpbib bib   45056 Sep 14 13:34 
> openmpi-sessions-40031@lorien_0
> srw-rw-r--    1 cmpbib bib       0 Sep 14 12:24 pmix-190552
> 
> cmpbib@lorien:/tmp/tmp.GoQXICeyJl/openmpi-sessions-40031@lorien_0> find . 
> -type f
> ./53310/contact.txt
> 
> cat 53310/contact.txt
> 3493724160.0;usock;tcp://132.203.7.36:54605 <http://132.203.7.36:54605/>
> 190552
> 
> egrep 'jobfam|stop' */*/Cerr* ../BIBTV/*/*/*/Cerr*|grep 53310
> dev/Test.FonctionsSUPG/Cerr.Triangle.h_cte_1.txt:[lorien:190552] 
> plm:base:set_hnp_name: final jobfam 53310
> 
> (this is the faulty test)
> full egrep:
> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.egrep.txt
>  
> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.egrep.txt>
>  
> 
> config.log:
> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s_config.log
>  
> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s_config.log>
>  
> 
> ompi_info:
> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s_ompi_info_all.txt
>  
> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s_ompi_info_all.txt>
>  
> 
> Maybe it aborted (instead of giving the other message) while doing the error 
> because of export OMPI_MCA_plm_base_verbose=5 ?
> 
> Thanks,
> 
> Eric
> 
> 
> On 14/09/16 10:27 AM, Gilles Gouaillardet wrote:
> Eric,
> 
> do you mean you have a unique $TMP per a.out ?
> or a unique $TMP per "batch" of run ?
> 
> in the first case, my understanding is that conflicts cannot happen ...
> 
> once you hit the bug, can you please please post the output of the
> failed a.out,
> and run
> egrep 'jobfam|stop'
> on all your logs, so we might spot a conflict
> 
> Cheers,
> 
> Gilles
> 
> On Wednesday, September 14, 2016, Eric Chamberland
> <[email protected] <>
> <mailto:[email protected] <>>> wrote:
> 
>     Lucky!
> 
>     Since each runs have a specific TMP, I still have it on disc.
> 
>     for the faulty run, the TMP variable was:
> 
>     TMP=/tmp/tmp.wOv5dkNaSI
> 
>     and into $TMP I have:
> 
>     openmpi-sessions-40031@lorien_0
> 
>     and into this subdirectory I have a bunch of empty dirs:
> 
> cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0>
>     ls -la |wc -l
>     1841
> 
> cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0>
>     ls -la |more
>     total 68
>     drwx------ 1840 cmpbib bib 45056 Sep 13 03:49 .
>     drwx------    3 cmpbib bib   231 Sep 13 03:50 ..
>     drwx------    2 cmpbib bib     6 Sep 13 02:10 10015
>     drwx------    2 cmpbib bib     6 Sep 13 03:05 10049
>     drwx------    2 cmpbib bib     6 Sep 13 03:15 10052
>     drwx------    2 cmpbib bib     6 Sep 13 02:22 10059
>     drwx------    2 cmpbib bib     6 Sep 13 02:22 10110
>     drwx------    2 cmpbib bib     6 Sep 13 02:41 10114
>     ...
> 
>     If I do:
> 
>     lsof |grep "openmpi-sessions-40031"
>     lsof: WARNING: can't stat() fuse.gvfsd-fuse file system
>     /run/user/1000/gvfs
>           Output information may be incomplete.
>     lsof: WARNING: can't stat() tracefs file system
>     /sys/kernel/debug/tracing
>           Output information may be incomplete.
> 
>     nothing...
> 
>     What else may I check?
> 
>     Eric
> 
> 
>     On 14/09/16 08:47 AM, Joshua Ladd wrote:
> 
>         Hi, Eric
> 
>         I **think** this might be related to the following:
> 
>         https://github.com/pmix/master/pull/145 
> <https://github.com/pmix/master/pull/145>
>         <https://github.com/pmix/master/pull/145 
> <https://github.com/pmix/master/pull/145>>
> 
>         I'm wondering if you can look into the /tmp directory and see if you
>         have a bunch of stale usock files.
> 
>         Best,
> 
>         Josh
> 
> 
>         On Wed, Sep 14, 2016 at 1:36 AM, Gilles Gouaillardet
>         <[email protected] <>
>         <mailto:[email protected] <>>> wrote:
> 
>             Eric,
> 
> 
>             can you please provide more information on how your tests
>         are launched ?
> 
>             do you
> 
>             mpirun -np 1 ./a.out
> 
>             or do you simply
> 
>             ./a.out
> 
> 
>             do you use a batch manager ? if yes, which one ?
> 
>             do you run one test per job ? or multiple tests per job ?
> 
>             how are these tests launched ?
> 
> 
>             do the test that crashes use MPI_Comm_spawn ?
> 
>             i am surprised by the process name [[9325,5754],0], which
>         suggests
>             there MPI_Comm_spawn was called 5753 times (!)
> 
> 
>             can you also run
> 
>             hostname
> 
>             on the 'lorien' host ?
> 
>             if you configure'd Open MPI with --enable-debug, can you
> 
>             export OMPI_MCA_plm_base_verbose 5
> 
>             then run one test and post the logs ?
> 
> 
>             from orte_plm_base_set_hnp_name(), "lorien" and pid 142766
>         should
>             produce job family 5576 (but you get 9325)
> 
>             the discrepancy could be explained by the use of a batch manager
>             and/or a full hostname i am unaware of.
> 
> 
>             orte_plm_base_set_hnp_name() generate a 16 bits job family
>         from the
>             (32 bits hash of the) hostname and the mpirun (32 bits ?) pid.
> 
>             so strictly speaking, it is possible two jobs launched on
>         the same
>             node are assigned the same 16 bits job family.
> 
> 
>             the easiest way to detect this could be to
> 
>             - edit orte/mca/plm/base/plm_base_jobid.c
> 
>             and replace
> 
>                 OPAL_OUTPUT_VERBOSE((5,
>         orte_plm_base_framework.framework_output,
>                                      "plm:base:set_hnp_name: final
>         jobfam %lu",
>                                      (unsigned long)jobfam));
> 
>             with
> 
>                 OPAL_OUTPUT_VERBOSE((4,
>         orte_plm_base_framework.framework_output,
>                                      "plm:base:set_hnp_name: final
>         jobfam %lu",
>                                      (unsigned long)jobfam));
> 
>             configure Open MPI with --enable-debug and rebuild
> 
>             and then
> 
>             export OMPI_MCA_plm_base_verbose=4
> 
>             and run your tests.
> 
> 
>             when the problem occurs, you will be able to check which pids
>             produced the faulty jobfam, and that could hint to a conflict.
> 
> 
>             Cheers,
> 
> 
>             Gilles
> 
> 
> 
>             On 9/14/2016 12:35 AM, Eric Chamberland wrote:
> 
>                 Hi,
> 
>                 It is the third time this happened into the last 10 days.
> 
>                 While running nighlty tests (~2200), we have one or two
>         tests
>                 that fails at the very beginning with this strange error:
> 
>                 [lorien:142766] [[9325,5754],0] usock_peer_recv_connect_ack:
>                 received unexpected process identifier [[9325,0],0] from
>                 [[5590,0],0]
> 
>                 But I can't reproduce the problem right now... ie: If I
>         launch
>                 this test alone "by hand", it is successful... the same
>         test was
>                 successful yesterday...
> 
>                 Is there some kind of "race condition" that can happen
>         on the
>                 creation of "tmp" files if many tests runs together on
>         the same
>                 node? (we are oversubcribing even sequential runs...)
> 
>                 Here are the build logs:
> 
> 
> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log
>  
> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log>
> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log
>  
> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log>>
> 
> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log
>  
> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log>
> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log
>  
> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log>>>
> 
> 
> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt
>  
> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt>
> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt
>  
> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt>>
> 
> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt
>  
> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt>
> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt
>  
> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt>>>
> 
> 
>                 Thanks,
> 
>                 Eric
>                 _______________________________________________
>                 devel mailing list
>                 [email protected] <> <mailto:[email protected] 
> <>>
> 
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>>
> 
> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>>>
> 
> 
>             _______________________________________________
>             devel mailing list
>             [email protected] <> <mailto:[email protected] <>>
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>>
> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>>>
> 
> 
>     _______________________________________________
>     devel mailing list
>     [email protected] <>
>     https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>>
> 
> _______________________________________________
> devel mailing list
> [email protected] <>
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
> 
> 
> _______________________________________________
> devel mailing list
> [email protected] <>
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
> _______________________________________________
> devel mailing list
> [email protected]
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

_______________________________________________
devel mailing list
[email protected]
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

Reply via email to