Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

Eric Chamberland Fri, 16 Sep 2016 06:08:35 -0700

Hi,

I know the pull request has not (yet) been merged, but here is asomewhat "different" output from a single sequential test(automatically) laucnhed without mpirun last night:

[lorien:172229] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rshpath NULL[lorien:172229] plm:base:set_hnp_name: initial bias 172229 nodename hash1366255883

[lorien:172229] plm:base:set_hnp_name: final jobfam 39075
[lorien:172229] [[39075,0],0] plm:rsh_setup on agent ssh : rsh path NULL
[lorien:172229] [[39075,0],0] plm:base:receive start comm
[lorien:172229] [[39075,0],0] plm:base:launch [39075,1] registered

[lorien:172229] [[39075,0],0] plm:base:launch job [39075,1] is not adynamic spawn[lorien:172218] [[41545,589],0] usock_peer_recv_connect_ack: receivedunexpected process identifier [[41545,0],0] from [[39075,0],0]

[lorien:172218] *** Process received signal ***
[lorien:172218] Signal: Segmentation fault (11)
[lorien:172218] Signal code: Invalid permissions (2)
[lorien:172218] Failing at address: 0x2d07e00

[lorien:172218] [ 0] [lorien:172229] [[39075,0],0] plm:base:receive stopcomm



unfortunately, I didn't got any coredump (???)  The line:

[lorien:172218] Signal code: Invalid permissions (2)

is curious or not?

as usual, here are the build logs:

http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.16.01h16m01s_config.log

http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.16.01h16m01s_ompi_info_all.txt

Does the PR #1376 will prevent or fix this too?

Thanks again!

Eric



On 15/09/16 09:32 AM, Eric Chamberland wrote:

Hi Gilles,

On 15/09/16 03:38 AM, Gilles Gouaillardet wrote:

Eric,


a bug has been identified, and a patch is available at
https://patch-diff.githubusercontent.com/raw/open-mpi/ompi-release/pull/1376.patch



the bug is specific to singleton mode (e.g. ./a.out vs mpirun -np 1
./a.out), so if applying a patch does not fit your test workflow,

it might be easier for you to update it and mpirun -np 1 ./a.out instead
of ./a.out


basically, increasing verbosity runs some extra code, which include
sprintf.
so yes, it is possible to crash an app by increasing verbosity by
running into a bug that is hidden under normal operation.
my intuition suggests this is quite unlikely ... if you can get a core
file and a backtrace, we will soon find out

Damn! I did got one but it got erased last night when the automatic
process started again... (which erase all directories before starting) :/

I would like to put core files in a user specific directory, but it
seems it has to be a system-wide configuration... :/  I will trick this
by changing the "pwd" to a path outside the erased directory...

So as of tonight I should be able to retrieve core files even after I
relaunched the process..

Thanks for all the support!

Eric


Cheers,

Gilles



On 9/15/2016 2:58 AM, Eric Chamberland wrote:

Ok,

one test segfaulted *but* I can't tell if it is the *same* bug because
there has been a segfault:

stderr:
http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.faultyCerr.Triangle.h_cte_1.txt



[lorien:190552] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh
path NULL
[lorien:190552] plm:base:set_hnp_name: initial bias 190552 nodename
hash 1366255883
[lorien:190552] plm:base:set_hnp_name: final jobfam 53310
[lorien:190552] [[53310,0],0] plm:rsh_setup on agent ssh : rsh path NULL
[lorien:190552] [[53310,0],0] plm:base:receive start comm
*** Error in `orted': realloc(): invalid next size: 0x0000000001e58770
***
...
...
[lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a
daemon on the local node in file ess_singleton_module.c at line 573
[lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a
daemon on the local node in file ess_singleton_module.c at line 163
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[lorien:190306] Local abort before MPI_INIT completed completed
successfully, but am not able to aggregate error messages, and not
able to guarantee that all other processes were killed!

stdout:

--------------------------------------------------------------------------


It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_init failed
  --> Returned value Unable to start a daemon on the local node (-127)
instead of ORTE_SUCCESS
--------------------------------------------------------------------------


--------------------------------------------------------------------------


It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or
environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: ompi_rte_init failed
  --> Returned "Unable to start a daemon on the local node" (-127)
instead of "Success" (0)
--------------------------------------------------------------------------



openmpi content of $TMP:

/tmp/tmp.GoQXICeyJl> ls -la
total 1500
drwx------    3 cmpbib bib     250 Sep 14 13:34 .
drwxrwxrwt  356 root   root  61440 Sep 14 13:45 ..
...
drwx------ 1848 cmpbib bib   45056 Sep 14 13:34
openmpi-sessions-40031@lorien_0
srw-rw-r--    1 cmpbib bib       0 Sep 14 12:24 pmix-190552

cmpbib@lorien:/tmp/tmp.GoQXICeyJl/openmpi-sessions-40031@lorien_0>
find . -type f
./53310/contact.txt

cat 53310/contact.txt
3493724160.0;usock;tcp://132.203.7.36:54605
190552

egrep 'jobfam|stop' */*/Cerr* ../BIBTV/*/*/*/Cerr*|grep 53310
dev/Test.FonctionsSUPG/Cerr.Triangle.h_cte_1.txt:[lorien:190552]
plm:base:set_hnp_name: final jobfam 53310

(this is the faulty test)
full egrep:
http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.egrep.txt



config.log:
http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s_config.log



ompi_info:
http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s_ompi_info_all.txt



Maybe it aborted (instead of giving the other message) while doing the
error because of export OMPI_MCA_plm_base_verbose=5 ?

Thanks,

Eric


On 14/09/16 10:27 AM, Gilles Gouaillardet wrote:

Eric,

do you mean you have a unique $TMP per a.out ?
or a unique $TMP per "batch" of run ?

in the first case, my understanding is that conflicts cannot happen ...

once you hit the bug, can you please please post the output of the
failed a.out,
and run
egrep 'jobfam|stop'
on all your logs, so we might spot a conflict

Cheers,

Gilles

On Wednesday, September 14, 2016, Eric Chamberland
<[email protected]
<mailto:[email protected]>> wrote:

    Lucky!

    Since each runs have a specific TMP, I still have it on disc.

    for the faulty run, the TMP variable was:

    TMP=/tmp/tmp.wOv5dkNaSI

    and into $TMP I have:

    openmpi-sessions-40031@lorien_0

    and into this subdirectory I have a bunch of empty dirs:

cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0>
    ls -la |wc -l
    1841

cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0>
    ls -la |more
    total 68
    drwx------ 1840 cmpbib bib 45056 Sep 13 03:49 .
    drwx------    3 cmpbib bib   231 Sep 13 03:50 ..
    drwx------    2 cmpbib bib     6 Sep 13 02:10 10015
    drwx------    2 cmpbib bib     6 Sep 13 03:05 10049
    drwx------    2 cmpbib bib     6 Sep 13 03:15 10052
    drwx------    2 cmpbib bib     6 Sep 13 02:22 10059
    drwx------    2 cmpbib bib     6 Sep 13 02:22 10110
    drwx------    2 cmpbib bib     6 Sep 13 02:41 10114
    ...

    If I do:

    lsof |grep "openmpi-sessions-40031"
    lsof: WARNING: can't stat() fuse.gvfsd-fuse file system
    /run/user/1000/gvfs
          Output information may be incomplete.
    lsof: WARNING: can't stat() tracefs file system
    /sys/kernel/debug/tracing
          Output information may be incomplete.

    nothing...

    What else may I check?

    Eric


    On 14/09/16 08:47 AM, Joshua Ladd wrote:

        Hi, Eric

        I **think** this might be related to the following:

        https://github.com/pmix/master/pull/145
        <https://github.com/pmix/master/pull/145>

        I'm wondering if you can look into the /tmp directory and see
if you
        have a bunch of stale usock files.

        Best,

        Josh


        On Wed, Sep 14, 2016 at 1:36 AM, Gilles Gouaillardet
        <[email protected]
        <mailto:[email protected]>> wrote:

            Eric,


            can you please provide more information on how your tests
        are launched ?

            do you

            mpirun -np 1 ./a.out

            or do you simply

            ./a.out


            do you use a batch manager ? if yes, which one ?

            do you run one test per job ? or multiple tests per job ?

            how are these tests launched ?


            do the test that crashes use MPI_Comm_spawn ?

            i am surprised by the process name [[9325,5754],0], which
        suggests
            there MPI_Comm_spawn was called 5753 times (!)


            can you also run

            hostname

            on the 'lorien' host ?

            if you configure'd Open MPI with --enable-debug, can you

            export OMPI_MCA_plm_base_verbose 5

            then run one test and post the logs ?


            from orte_plm_base_set_hnp_name(), "lorien" and pid 142766
        should
            produce job family 5576 (but you get 9325)

            the discrepancy could be explained by the use of a batch
manager
            and/or a full hostname i am unaware of.


            orte_plm_base_set_hnp_name() generate a 16 bits job family
        from the
            (32 bits hash of the) hostname and the mpirun (32 bits ?)
pid.

            so strictly speaking, it is possible two jobs launched on
        the same
            node are assigned the same 16 bits job family.


            the easiest way to detect this could be to

            - edit orte/mca/plm/base/plm_base_jobid.c

            and replace

                OPAL_OUTPUT_VERBOSE((5,
        orte_plm_base_framework.framework_output,
                                     "plm:base:set_hnp_name: final
        jobfam %lu",
                                     (unsigned long)jobfam));

            with

                OPAL_OUTPUT_VERBOSE((4,
        orte_plm_base_framework.framework_output,
                                     "plm:base:set_hnp_name: final
        jobfam %lu",
                                     (unsigned long)jobfam));

            configure Open MPI with --enable-debug and rebuild

            and then

            export OMPI_MCA_plm_base_verbose=4

            and run your tests.


            when the problem occurs, you will be able to check which
pids
            produced the faulty jobfam, and that could hint to a
conflict.


            Cheers,


            Gilles



            On 9/14/2016 12:35 AM, Eric Chamberland wrote:

                Hi,

                It is the third time this happened into the last 10
days.

                While running nighlty tests (~2200), we have one or two
        tests
                that fails at the very beginning with this strange
error:

                [lorien:142766] [[9325,5754],0]
usock_peer_recv_connect_ack:
                received unexpected process identifier [[9325,0],0]
from
                [[5590,0],0]

                But I can't reproduce the problem right now... ie: If I
        launch
                this test alone "by hand", it is successful... the same
        test was
                successful yesterday...

                Is there some kind of "race condition" that can happen
        on the
                creation of "tmp" files if many tests runs together on
        the same
                node? (we are oversubcribing even sequential runs...)

                Here are the build logs:


http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log


<http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log>



<http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log


<http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log>>




http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt


<http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt>



<http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt


<http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt>>




                Thanks,

                Eric
                _______________________________________________
                devel mailing list
                [email protected]
<mailto:[email protected]>

https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>

<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>>


            _______________________________________________
            devel mailing list
            [email protected] <mailto:[email protected]>
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>>


    _______________________________________________
    devel mailing list
    [email protected]
    https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>

_______________________________________________
devel mailing list
[email protected]
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel


_______________________________________________
devel mailing list
[email protected]
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

_______________________________________________
devel mailing list
[email protected]
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

_______________________________________________
devel mailing list
[email protected]
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

Reply via email to