Fixed - r26406

On May 7, 2012, at 10:35 PM, Eugene Loh wrote:

> Here is another trunk hang.  I get it if I use at least three remote nodes.  
> E.g., with r26385:
> 
> % mpirun -H remoteA,remoteB,remoteC -n 2 hostname
> [remoteA:20508] [[54625,0],1] ORTE_ERROR_LOG: Not found in file 
> base/ess_base_fns.c at line 135
> [remoteA:20508] [[54625,0],1] unable to get hostname for daemon 3
> [remoteA:20508] [[54625,0],1] ORTE_ERROR_LOG: Not found in file 
> orted/orted_comm.c at line 345
> [hang]
> 
> I think the problem first appeared with r26359.
> 
> I guess if a remote orted has to spawn another orted, it gets here:
> 
>  opal_pointer_array_get_item(table = 0x7e410, element_index = 3), line 136 in 
> "opal_pointer_array.h"
>  find_proc(proc = 0xffbff264), line 51 in "ess_base_fns.c"
>  orte_ess_base_proc_get_hostname(proc = 0xffbff264), line 134 in 
> "ess_base_fns.c"
>  remote_spawn(launch = 0x85f30), line 812 in "plm_rsh_module.c"
>  orte_daemon_recv(status = 0, sender = 0x85f54, buffer = 0x85f30, tag = 1U, 
> cbdata = (nil)), line 344 in "orted_comm.c"
>  orte_rml_recv_msg_callback(status = 0, peer = 0x69014, iov = 0x7d7e0, count 
> = 2, tag = 1U, cbdata = 0x85ec0), line 68 in "rml_oob_recv.c"
>  mca_oob_tcp_msg_data(msg = 0x85310, peer = 0x69000), line 436 in 
> "oob_tcp_msg.c"
>  mca_oob_tcp_msg_recv_complete(msg = 0x85310, peer = 0x69000), line 322 in 
> "oob_tcp_msg.c"
>  mca_oob_tcp_peer_recv_handler(sd = 13, flags = 2, user = 0x69000), line 942 
> in "oob_tcp_peer.c"
>  event_persist_closure(base = 0x3c600, ev = 0x647a8), line 1280 in "event.c"
>  event_process_active_single_queue(base = 0x3c600, activeq = 0x3c4f0), line 
> 1324 in "event.c"
>  event_process_active(base = 0x3c600), line 1396 in "event.c"
>  opal_libevent2013_event_base_loop(base = 0x3c600, flags = 1), line 1593 in 
> "event.c"
>  orte_daemon(argc = 19, argv = 0xffbff97c), line 729 in "orted_main.c"
>  main(argc = 19, argv = 0xffbff97c), line 62 in "orted.c"
> 
> So, in my case, I'm trying to look up item 3 while only item 1 in the array 
> appears to be initialized.
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Reply via email to