I’ll try to replicate, and provide some diagnostics targeting this exchange. 
What is happening is that the client process is attempting to connect to the 
ORTE daemon, and for some reason the connection isn’t generating a response 
from the daemon.

I’ll also add a timeout function in there so we don’t hang when this happens, 
but instead cleanly error out.


> On Sep 3, 2015, at 11:15 AM, Howard Pritchard <hpprit...@gmail.com> wrote:
> 
> Hi Folks,
> 
> I'm seeing again a case of a hang (yes I'm going to start using timeout) of a 
> two process
> run on the iu jenkins server for master.  This is the --disable-dlopen 
> jenkins project for
> the IU jenkins server.
> 
> I attached to the hanging processes and get this for a backtrace:
> 
> #0  0x00007fdd4ca7ae94 in recv () from /lib64/libpthread.so.0
> 
> #1  0x00007fdd4bab622a in opal_pmix_pmix1xx_pmix_usock_recv_blocking (sd=13, 
> data=0x7fff9342fb78 "&", size=4)
> 
>     at src/usock/usock.c:157
> 
> #2  0x00007fdd4babad69 in recv_connect_ack (sd=13) at 
> src/client/pmix_client.c:777
> 
> #3  0x00007fdd4babbc59 in usock_connect (addr=0x7fff9342fe80) at 
> src/client/pmix_client.c:1026
> 
> #4  0x00007fdd4bab88ae in connect_to_server (address=0x7fff9342fe80, 
> cbdata=0x7fff9342fc30) at src/client/pmix_client.c:177
> 
> #5  0x00007fdd4bab90f7 in OPAL_PMIX_PMIX1XX_PMIx_Init (proc=0x7fdd4c2e9820 
> <myproc>) at src/client/pmix_client.c:329
> 
> #6  0x00007fdd4bff1892 in pmix1_client_init () at pmix1_client.c:58
> 
> #7  0x00007fdd4c37ce1d in pmi_component_query (module=0x7fff9342ffd0, 
> priority=0x7fff9342ffcc) at ess_pmi_component.c:89
> 
> #8  0x00007fdd4bf54c38 in mca_base_select (type_name=0x7fdd4c45e5b9 "ess", 
> output_id=-1, 
> 
>     components_available=0x7fdd4c6b21d0 <orte_ess_base_framework+80>, 
> best_module=0x7fff93430000, best_component=0x7fff93430008)
> 
>     at mca_base_components_select.c:73
> 
> #9  0x00007fdd4c373f0d in orte_ess_base_select () at base/ess_base_select.c:39
> 
> #10 0x00007fdd4c312fed in orte_init (pargc=0x0, pargv=0x0, flags=32) at 
> runtime/orte_init.c:221
> 
> #11 0x00007fdd4d788e26 in ompi_mpi_init (argc=0, argv=0x0, requested=0, 
> provided=0x7fff934300fc) at runtime/ompi_mpi_init.c:468
> 
> #12 0x00007fdd4d7be27a in PMPI_Init (argc=0x7fff93430138, 
> argv=0x7fff93430130) at pinit.c:84
> 
> #13 0x00007fdd4dce515e in ompi_init_f (ierr=0x7fff9343043c) at pinit_f.c:82
> 
> #14 0x0000000000400dff in MAIN__ ()
> 
> #15 0x0000000000400f38 in main ()
> 
> This seems to only happen periodically.  
> 
> Any suggestions on how to further analyze?
> 
> Howard
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/09/17943.php

Reply via email to