Re: [OMPI devel] Crash in orte_iof_hnp_read_local_handler

Sylvain Jeaugey Fri, 26 Feb 2016 15:45:53 -0500 (EST)

No, the child processes are only calling MPIX_Query_cuda_support whichis just "return OPAL_CUDA_SUPPORT". I can reproduce the problem with"ls" (see above).

I don't have the line numbers, but from the calling stack, the only wayit could segfault is that "&proct->stdinev->daemon" is wrong inorte_iof_hnp_read_local_handler (orte/mca/iof/hnp/iof_hnp_read.c:145).

Which means that the cbdata passed from libevent toorte_iof_hnp_read_local_handler() is wrong or was destroyed, freed, ....The crash seems to happen after some ranks already finished (but othersdidn't start yet).

Finally, I found how to reproduce it easily. You need to have orted do 3things at the same time : process stdout (child processes writing tostdout), stdin (I'm hitting enter to produce stdin to mpirun) and tcpconnections (mpirun between multiple nodes). If run within the node, Iget no crash, if I don't hit "enter", no crash. If I call "sleep 1"instead of "ls /", no crash.


So I run this loop :
  while mpirun -host <two nodes at least> -np 6 ls /; do true; done
  <works fine until you hit enter and you get the crash>

I'm not sure why MTT is reproducing the error ... does it write tompirun stdin ?


On 02/26/2016 11:46 AM, Ralph Castain wrote:

So the child processes are not calling orte_init or anything like that? I can 
check it - any chance you can give me a line number via a debug build?

On Feb 26, 2016, at 11:42 AM, Sylvain Jeaugey <[email protected]> wrote:

I got this strange crash on master this night running nv/mpix_test :

Signal: Segmentation fault (11)
Signal code: Address not mapped (1)
Failing at address: 0x50
[ 0] /lib64/libpthread.so.0(+0xf710)[0x7f9f19a80710]
[ 1] 
/ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-4/installs/eGXW/install/lib/libopen-rte.so.0(orte_util_compare_name_fields+0x81)[0x7f9f1a88f6d7]
[ 2] 
/ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-4/installs/eGXW/install/lib/openmpi/mca_iof_hnp.so(orte_iof_hnp_read_local_handler+0x247)[0x7f9f1109b4ab]
[ 3] 
/ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-4/installs/eGXW/install/lib/libopen-pal.so.0(opal_libevent2022_event_base_loop+0xbf1)[0x7f9f1a5b68f1]
[ 4] mpirun[0x405649][drossetti-ivy4:31651] [ 5] mpirun[0x403a48]
[ 6] /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f9f196fbd1d]
[ 7] mpirun[0x4038e9]
*** End of error message ***

This test is not even calling MPI_Init/Finalize, only MPIX_Query_cuda_support. 
So it is really an ORTE race condition, and the problem is hard to reproduce. 
It takes sometimes more than 50 runs with random sleep between runs to see the 
problem.

I don't even know if we want to fix that -- what do you think ?

Sylvain



-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------
_______________________________________________
devel mailing list
[email protected]
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2016/02/18635.php

_______________________________________________
devel mailing list
[email protected]
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2016/02/18636.php

Re: [OMPI devel] Crash in orte_iof_hnp_read_local_handler

Reply via email to