Hi Ralph,

Warning that it seems to be hard to reproduce, at least on the UH server.

Howard


2015-09-03 13:12 GMT-06:00 Ralph Castain <r...@open-mpi.org>:

> I’ll try to replicate, and provide some diagnostics targeting this
> exchange. What is happening is that the client process is attempting to
> connect to the ORTE daemon, and for some reason the connection isn’t
> generating a response from the daemon.
>
> I’ll also add a timeout function in there so we don’t hang when this
> happens, but instead cleanly error out.
>
>
> On Sep 3, 2015, at 11:15 AM, Howard Pritchard <hpprit...@gmail.com> wrote:
>
> Hi Folks,
>
> I'm seeing again a case of a hang (yes I'm going to start using timeout)
> of a two process
> run on the iu jenkins server for master.  This is the --disable-dlopen
> jenkins project for
> the IU jenkins server.
>
> I attached to the hanging processes and get this for a backtrace:
>
> #0  0x00007fdd4ca7ae94 in recv () from /lib64/libpthread.so.0
>
> #1  0x00007fdd4bab622a in opal_pmix_pmix1xx_pmix_usock_recv_blocking
> (sd=13, data=0x7fff9342fb78 "&", size=4)
>
>     at src/usock/usock.c:157
>
> #2  0x00007fdd4babad69 in recv_connect_ack (sd=13) at
> src/client/pmix_client.c:777
>
> #3  0x00007fdd4babbc59 in usock_connect (addr=0x7fff9342fe80) at
> src/client/pmix_client.c:1026
>
> #4  0x00007fdd4bab88ae in connect_to_server (address=0x7fff9342fe80,
> cbdata=0x7fff9342fc30) at src/client/pmix_client.c:177
>
> #5  0x00007fdd4bab90f7 in OPAL_PMIX_PMIX1XX_PMIx_Init (proc=0x7fdd4c2e9820
> <myproc>) at src/client/pmix_client.c:329
>
> #6  0x00007fdd4bff1892 in pmix1_client_init () at pmix1_client.c:58
>
> #7  0x00007fdd4c37ce1d in pmi_component_query (module=0x7fff9342ffd0,
> priority=0x7fff9342ffcc) at ess_pmi_component.c:89
>
> #8  0x00007fdd4bf54c38 in mca_base_select (type_name=0x7fdd4c45e5b9 "ess",
> output_id=-1,
>
>     components_available=0x7fdd4c6b21d0 <orte_ess_base_framework+80>,
> best_module=0x7fff93430000, best_component=0x7fff93430008)
>
>     at mca_base_components_select.c:73
>
> #9  0x00007fdd4c373f0d in orte_ess_base_select () at
> base/ess_base_select.c:39
>
> #10 0x00007fdd4c312fed in orte_init (pargc=0x0, pargv=0x0, flags=32) at
> runtime/orte_init.c:221
>
> #11 0x00007fdd4d788e26 in ompi_mpi_init (argc=0, argv=0x0, requested=0,
> provided=0x7fff934300fc) at runtime/ompi_mpi_init.c:468
>
> #12 0x00007fdd4d7be27a in PMPI_Init (argc=0x7fff93430138,
> argv=0x7fff93430130) at pinit.c:84
>
> #13 0x00007fdd4dce515e in ompi_init_f (ierr=0x7fff9343043c) at pinit_f.c:82
>
> #14 0x0000000000400dff in MAIN__ ()
>
> #15 0x0000000000400f38 in main ()
>
> This seems to only happen periodically.
>
> Any suggestions on how to further analyze?
>
> Howard
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/09/17943.php
>
>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/09/17946.php
>

Reply via email to