No, the child processes are only calling MPIX_Query_cuda_support which is just "return OPAL_CUDA_SUPPORT". I can reproduce the problem with "ls" (see above).

I don't have the line numbers, but from the calling stack, the only way it could segfault is that "&proct->stdinev->daemon" is wrong in orte_iof_hnp_read_local_handler (orte/mca/iof/hnp/iof_hnp_read.c:145).

Which means that the cbdata passed from libevent to orte_iof_hnp_read_local_handler() is wrong or was destroyed, freed, .... The crash seems to happen after some ranks already finished (but others didn't start yet).

Finally, I found how to reproduce it easily. You need to have orted do 3 things at the same time : process stdout (child processes writing to stdout), stdin (I'm hitting enter to produce stdin to mpirun) and tcp connections (mpirun between multiple nodes). If run within the node, I get no crash, if I don't hit "enter", no crash. If I call "sleep 1" instead of "ls /", no crash.

So I run this loop :
  while mpirun -host <two nodes at least> -np 6 ls /; do true; done
  <works fine until you hit enter and you get the crash>

I'm not sure why MTT is reproducing the error ... does it write to mpirun stdin ?

On 02/26/2016 11:46 AM, Ralph Castain wrote:
So the child processes are not calling orte_init or anything like that? I can 
check it - any chance you can give me a line number via a debug build?

On Feb 26, 2016, at 11:42 AM, Sylvain Jeaugey <sjeau...@nvidia.com> wrote:

I got this strange crash on master this night running nv/mpix_test :

Signal: Segmentation fault (11)
Signal code: Address not mapped (1)
Failing at address: 0x50
[ 0] /lib64/libpthread.so.0(+0xf710)[0x7f9f19a80710]
[ 1] 
/ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-4/installs/eGXW/install/lib/libopen-rte.so.0(orte_util_compare_name_fields+0x81)[0x7f9f1a88f6d7]
[ 2] 
/ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-4/installs/eGXW/install/lib/openmpi/mca_iof_hnp.so(orte_iof_hnp_read_local_handler+0x247)[0x7f9f1109b4ab]
[ 3] 
/ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-4/installs/eGXW/install/lib/libopen-pal.so.0(opal_libevent2022_event_base_loop+0xbf1)[0x7f9f1a5b68f1]
[ 4] mpirun[0x405649][drossetti-ivy4:31651] [ 5] mpirun[0x403a48]
[ 6] /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f9f196fbd1d]
[ 7] mpirun[0x4038e9]
*** End of error message ***

This test is not even calling MPI_Init/Finalize, only MPIX_Query_cuda_support. 
So it is really an ORTE race condition, and the problem is hard to reproduce. 
It takes sometimes more than 50 runs with random sleep between runs to see the 
problem.

I don't even know if we want to fix that -- what do you think ?

Sylvain



-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------
_______________________________________________
devel mailing list
de...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2016/02/18635.php
_______________________________________________
devel mailing list
de...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2016/02/18636.php

Reply via email to