Hi, Can I please be removed from this list?
Thanks, Jeremy On Thu, Sep 15, 2016 at 8:44 AM, r...@open-mpi.org <r...@open-mpi.org> wrote: > I don’t think a collision was the issue here. We were taking the > mpirun-generated jobid and passing it thru the hash, thus creating an > incorrect and invalid value. What I’m more surprised by is that it doesn’t > -always- fail. Only thing I can figure is that, unlike with PMIx, the usock > oob component doesn’t check the incoming identifier of the connecting proc > to see if it is someone it knows. So unless you just happened to hash into > a daemon jobid form, it would accept the connection (even though the name > wasn’t correct). > > I think this should fix the issue. Let’s wait and see > > > On Sep 15, 2016, at 4:47 AM, Gilles Gouaillardet < > gilles.gouaillar...@gmail.com> wrote: > > I just realized i screwed up my test, and i was missing some relevant > info... > So on one hand, i fixed a bug in singleton, > But on the other hand, i cannot tell whether a collision was involved in > this issue > > Cheers, > > Gilles > > Joshua Ladd <jladd.m...@gmail.com> wrote: > Great catch, Gilles! Not much of a surprise though. > > Indeed, this issue has EVERYTHING to do with how PMIx is calculating the > jobid, which, in this case, results in hash collisions. ;-P > > Josh > > On Thursday, September 15, 2016, Gilles Gouaillardet <gil...@rist.or.jp> > wrote: > >> Eric, >> >> >> a bug has been identified, and a patch is available at >> https://patch-diff.githubusercontent.com/raw/open-mpi/ompi-r >> elease/pull/1376.patch >> >> the bug is specific to singleton mode (e.g. ./a.out vs mpirun -np 1 >> ./a.out), so if applying a patch does not fit your test workflow, >> >> it might be easier for you to update it and mpirun -np 1 ./a.out instead >> of ./a.out >> >> >> basically, increasing verbosity runs some extra code, which include >> sprintf. >> so yes, it is possible to crash an app by increasing verbosity by running >> into a bug that is hidden under normal operation. >> my intuition suggests this is quite unlikely ... if you can get a core >> file and a backtrace, we will soon find out >> >> >> Cheers, >> >> Gilles >> >> >> >> On 9/15/2016 2:58 AM, Eric Chamberland wrote: >> >>> Ok, >>> >>> one test segfaulted *but* I can't tell if it is the *same* bug because >>> there has been a segfault: >>> >>> stderr: >>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14 >>> .10h38m52s.faultyCerr.Triangle.h_cte_1.txt >>> >>> [lorien:190552] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh >>> path NULL >>> [lorien:190552] plm:base:set_hnp_name: initial bias 190552 nodename hash >>> 1366255883 >>> [lorien:190552] plm:base:set_hnp_name: final jobfam 53310 >>> [lorien:190552] [[53310,0],0] plm:rsh_setup on agent ssh : rsh path NULL >>> [lorien:190552] [[53310,0],0] plm:base:receive start comm >>> *** Error in `orted': realloc(): invalid next size: 0x0000000001e58770 >>> *** >>> ... >>> ... >>> [lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a >>> daemon on the local node in file ess_singleton_module.c at line 573 >>> [lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a >>> daemon on the local node in file ess_singleton_module.c at line 163 >>> *** An error occurred in MPI_Init_thread >>> *** on a NULL communicator >>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, >>> *** and potentially your MPI job) >>> [lorien:190306] Local abort before MPI_INIT completed completed >>> successfully, but am not able to aggregate error messages, and not able to >>> guarantee that all other processes were killed! >>> >>> stdout: >>> >>> -------------------------------------------------------------------------- >>> >>> It looks like orte_init failed for some reason; your parallel process is >>> likely to abort. There are many reasons that a parallel process can >>> fail during orte_init; some of which are due to configuration or >>> environment problems. This failure appears to be an internal failure; >>> here's some additional information (which may only be relevant to an >>> Open MPI developer): >>> >>> orte_ess_init failed >>> --> Returned value Unable to start a daemon on the local node (-127) >>> instead of ORTE_SUCCESS >>> -------------------------------------------------------------------------- >>> >>> -------------------------------------------------------------------------- >>> >>> It looks like MPI_INIT failed for some reason; your parallel process is >>> likely to abort. There are many reasons that a parallel process can >>> fail during MPI_INIT; some of which are due to configuration or >>> environment >>> problems. This failure appears to be an internal failure; here's some >>> additional information (which may only be relevant to an Open MPI >>> developer): >>> >>> ompi_mpi_init: ompi_rte_init failed >>> --> Returned "Unable to start a daemon on the local node" (-127) >>> instead of "Success" (0) >>> -------------------------------------------------------------------------- >>> >>> >>> openmpi content of $TMP: >>> >>> /tmp/tmp.GoQXICeyJl> ls -la >>> total 1500 >>> drwx------ 3 cmpbib bib 250 Sep 14 13:34 . >>> drwxrwxrwt 356 root root 61440 Sep 14 13:45 .. >>> ... >>> drwx------ 1848 cmpbib bib 45056 Sep 14 13:34 >>> openmpi-sessions-40031@lorien_0 >>> srw-rw-r-- 1 cmpbib bib 0 Sep 14 12:24 pmix-190552 >>> >>> cmpbib@lorien:/tmp/tmp.GoQXICeyJl/openmpi-sessions-40031@lorien_0> find >>> . -type f >>> ./53310/contact.txt >>> >>> cat 53310/contact.txt >>> 3493724160.0;usock;tcp://132.203.7.36:54605 >>> 190552 >>> >>> egrep 'jobfam|stop' */*/Cerr* ../BIBTV/*/*/*/Cerr*|grep 53310 >>> dev/Test.FonctionsSUPG/Cerr.Triangle.h_cte_1.txt:[lorien:190552] >>> plm:base:set_hnp_name: final jobfam 53310 >>> >>> (this is the faulty test) >>> full egrep: >>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14 >>> .10h38m52s.egrep.txt >>> >>> config.log: >>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14 >>> .10h38m52s_config.log >>> >>> ompi_info: >>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14 >>> .10h38m52s_ompi_info_all.txt >>> >>> Maybe it aborted (instead of giving the other message) while doing the >>> error because of export OMPI_MCA_plm_base_verbose=5 ? >>> >>> Thanks, >>> >>> Eric >>> >>> >>> On 14/09/16 10:27 AM, Gilles Gouaillardet wrote: >>> >>>> Eric, >>>> >>>> do you mean you have a unique $TMP per a.out ? >>>> or a unique $TMP per "batch" of run ? >>>> >>>> in the first case, my understanding is that conflicts cannot happen ... >>>> >>>> once you hit the bug, can you please please post the output of the >>>> failed a.out, >>>> and run >>>> egrep 'jobfam|stop' >>>> on all your logs, so we might spot a conflict >>>> >>>> Cheers, >>>> >>>> Gilles >>>> >>>> On Wednesday, September 14, 2016, Eric Chamberland >>>> <eric.chamberl...@giref.ulaval.ca >>>> <mailto:eric.chamberl...@giref.ulaval.ca>> wrote: >>>> >>>> Lucky! >>>> >>>> Since each runs have a specific TMP, I still have it on disc. >>>> >>>> for the faulty run, the TMP variable was: >>>> >>>> TMP=/tmp/tmp.wOv5dkNaSI >>>> >>>> and into $TMP I have: >>>> >>>> openmpi-sessions-40031@lorien_0 >>>> >>>> and into this subdirectory I have a bunch of empty dirs: >>>> >>>> cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0> >>>> ls -la |wc -l >>>> 1841 >>>> >>>> cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0> >>>> ls -la |more >>>> total 68 >>>> drwx------ 1840 cmpbib bib 45056 Sep 13 03:49 . >>>> drwx------ 3 cmpbib bib 231 Sep 13 03:50 .. >>>> drwx------ 2 cmpbib bib 6 Sep 13 02:10 10015 >>>> drwx------ 2 cmpbib bib 6 Sep 13 03:05 10049 >>>> drwx------ 2 cmpbib bib 6 Sep 13 03:15 10052 >>>> drwx------ 2 cmpbib bib 6 Sep 13 02:22 10059 >>>> drwx------ 2 cmpbib bib 6 Sep 13 02:22 10110 >>>> drwx------ 2 cmpbib bib 6 Sep 13 02:41 10114 >>>> ... >>>> >>>> If I do: >>>> >>>> lsof |grep "openmpi-sessions-40031" >>>> lsof: WARNING: can't stat() fuse.gvfsd-fuse file system >>>> /run/user/1000/gvfs >>>> Output information may be incomplete. >>>> lsof: WARNING: can't stat() tracefs file system >>>> /sys/kernel/debug/tracing >>>> Output information may be incomplete. >>>> >>>> nothing... >>>> >>>> What else may I check? >>>> >>>> Eric >>>> >>>> >>>> On 14/09/16 08:47 AM, Joshua Ladd wrote: >>>> >>>> Hi, Eric >>>> >>>> I **think** this might be related to the following: >>>> >>>> https://github.com/pmix/master/pull/145 >>>> <https://github.com/pmix/master/pull/145> >>>> >>>> I'm wondering if you can look into the /tmp directory and see >>>> if you >>>> have a bunch of stale usock files. >>>> >>>> Best, >>>> >>>> Josh >>>> >>>> >>>> On Wed, Sep 14, 2016 at 1:36 AM, Gilles Gouaillardet >>>> <gil...@rist.or.jp >>>> <mailto:gil...@rist.or.jp>> wrote: >>>> >>>> Eric, >>>> >>>> >>>> can you please provide more information on how your tests >>>> are launched ? >>>> >>>> do you >>>> >>>> mpirun -np 1 ./a.out >>>> >>>> or do you simply >>>> >>>> ./a.out >>>> >>>> >>>> do you use a batch manager ? if yes, which one ? >>>> >>>> do you run one test per job ? or multiple tests per job ? >>>> >>>> how are these tests launched ? >>>> >>>> >>>> do the test that crashes use MPI_Comm_spawn ? >>>> >>>> i am surprised by the process name [[9325,5754],0], which >>>> suggests >>>> there MPI_Comm_spawn was called 5753 times (!) >>>> >>>> >>>> can you also run >>>> >>>> hostname >>>> >>>> on the 'lorien' host ? >>>> >>>> if you configure'd Open MPI with --enable-debug, can you >>>> >>>> export OMPI_MCA_plm_base_verbose 5 >>>> >>>> then run one test and post the logs ? >>>> >>>> >>>> from orte_plm_base_set_hnp_name(), "lorien" and pid 142766 >>>> should >>>> produce job family 5576 (but you get 9325) >>>> >>>> the discrepancy could be explained by the use of a batch >>>> manager >>>> and/or a full hostname i am unaware of. >>>> >>>> >>>> orte_plm_base_set_hnp_name() generate a 16 bits job family >>>> from the >>>> (32 bits hash of the) hostname and the mpirun (32 bits ?) >>>> pid. >>>> >>>> so strictly speaking, it is possible two jobs launched on >>>> the same >>>> node are assigned the same 16 bits job family. >>>> >>>> >>>> the easiest way to detect this could be to >>>> >>>> - edit orte/mca/plm/base/plm_base_jobid.c >>>> >>>> and replace >>>> >>>> OPAL_OUTPUT_VERBOSE((5, >>>> orte_plm_base_framework.framework_output, >>>> "plm:base:set_hnp_name: final >>>> jobfam %lu", >>>> (unsigned long)jobfam)); >>>> >>>> with >>>> >>>> OPAL_OUTPUT_VERBOSE((4, >>>> orte_plm_base_framework.framework_output, >>>> "plm:base:set_hnp_name: final >>>> jobfam %lu", >>>> (unsigned long)jobfam)); >>>> >>>> configure Open MPI with --enable-debug and rebuild >>>> >>>> and then >>>> >>>> export OMPI_MCA_plm_base_verbose=4 >>>> >>>> and run your tests. >>>> >>>> >>>> when the problem occurs, you will be able to check which >>>> pids >>>> produced the faulty jobfam, and that could hint to a >>>> conflict. >>>> >>>> >>>> Cheers, >>>> >>>> >>>> Gilles >>>> >>>> >>>> >>>> On 9/14/2016 12:35 AM, Eric Chamberland wrote: >>>> >>>> Hi, >>>> >>>> It is the third time this happened into the last 10 >>>> days. >>>> >>>> While running nighlty tests (~2200), we have one or two >>>> tests >>>> that fails at the very beginning with this strange >>>> error: >>>> >>>> [lorien:142766] [[9325,5754],0] >>>> usock_peer_recv_connect_ack: >>>> received unexpected process identifier [[9325,0],0] from >>>> [[5590,0],0] >>>> >>>> But I can't reproduce the problem right now... ie: If I >>>> launch >>>> this test alone "by hand", it is successful... the same >>>> test was >>>> successful yesterday... >>>> >>>> Is there some kind of "race condition" that can happen >>>> on the >>>> creation of "tmp" files if many tests runs together on >>>> the same >>>> node? (we are oversubcribing even sequential runs...) >>>> >>>> Here are the build logs: >>>> >>>> >>>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13 >>>> .01h16m01s_config.log >>>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.1 >>>> 3.01h16m01s_config.log> >>>> >>>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.1 >>>> 3.01h16m01s_config.log >>>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.1 >>>> 3.01h16m01s_config.log>> >>>> >>>> >>>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13 >>>> .01h16m01s_ompi_info_all.txt >>>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.1 >>>> 3.01h16m01s_ompi_info_all.txt> >>>> >>>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.1 >>>> 3.01h16m01s_ompi_info_all.txt >>>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.1 >>>> 3.01h16m01s_ompi_info_all.txt>> >>>> >>>> >>>> Thanks, >>>> >>>> Eric >>>> _______________________________________________ >>>> devel mailing list >>>> devel@lists.open-mpi.org <mailto:de...@lists.open-mpi.o >>>> rg> >>>> >>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel> >>>> >>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>> >>>> >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> >>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel> >>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>> >>>> >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> devel@lists.open-mpi.org >>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel> >>>> >>>> _______________________________________________ >>> devel mailing list >>> devel@lists.open-mpi.org >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>> >>> >> _______________________________________________ >> devel mailing list >> devel@lists.open-mpi.org >> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >> > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel > > > > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel > -- JM
_______________________________________________ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel