No, the child processes are only calling MPIX_Query_cuda_support which
is just "return OPAL_CUDA_SUPPORT". I can reproduce the problem with
"ls" (see above).
I don't have the line numbers, but from the calling stack, the only way
it could segfault is that "&proct->stdinev->daemon" is wrong in
orte_iof_hnp_read_local_handler (orte/mca/iof/hnp/iof_hnp_read.c:145).
Which means that the cbdata passed from libevent to
orte_iof_hnp_read_local_handler() is wrong or was destroyed, freed, ....
The crash seems to happen after some ranks already finished (but others
didn't start yet).
Finally, I found how to reproduce it easily. You need to have orted do 3
things at the same time : process stdout (child processes writing to
stdout), stdin (I'm hitting enter to produce stdin to mpirun) and tcp
connections (mpirun between multiple nodes). If run within the node, I
get no crash, if I don't hit "enter", no crash. If I call "sleep 1"
instead of "ls /", no crash.
So I run this loop :
while mpirun -host <two nodes at least> -np 6 ls /; do true; done
<works fine until you hit enter and you get the crash>
I'm not sure why MTT is reproducing the error ... does it write to
mpirun stdin ?
On 02/26/2016 11:46 AM, Ralph Castain wrote:
So the child processes are not calling orte_init or anything like that? I can
check it - any chance you can give me a line number via a debug build?
On Feb 26, 2016, at 11:42 AM, Sylvain Jeaugey <sjeau...@nvidia.com> wrote:
I got this strange crash on master this night running nv/mpix_test :
Signal: Segmentation fault (11)
Signal code: Address not mapped (1)
Failing at address: 0x50
[ 0] /lib64/libpthread.so.0(+0xf710)[0x7f9f19a80710]
[ 1]
/ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-4/installs/eGXW/install/lib/libopen-rte.so.0(orte_util_compare_name_fields+0x81)[0x7f9f1a88f6d7]
[ 2]
/ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-4/installs/eGXW/install/lib/openmpi/mca_iof_hnp.so(orte_iof_hnp_read_local_handler+0x247)[0x7f9f1109b4ab]
[ 3]
/ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-4/installs/eGXW/install/lib/libopen-pal.so.0(opal_libevent2022_event_base_loop+0xbf1)[0x7f9f1a5b68f1]
[ 4] mpirun[0x405649][drossetti-ivy4:31651] [ 5] mpirun[0x403a48]
[ 6] /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f9f196fbd1d]
[ 7] mpirun[0x4038e9]
*** End of error message ***
This test is not even calling MPI_Init/Finalize, only MPIX_Query_cuda_support.
So it is really an ORTE race condition, and the problem is hard to reproduce.
It takes sometimes more than 50 runs with random sleep between runs to see the
problem.
I don't even know if we want to fix that -- what do you think ?
Sylvain
-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may
contain
confidential information. Any unauthorized review, use, disclosure or
distribution
is prohibited. If you are not the intended recipient, please contact the
sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------
_______________________________________________
devel mailing list
de...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post:
http://www.open-mpi.org/community/lists/devel/2016/02/18635.php
_______________________________________________
devel mailing list
de...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post:
http://www.open-mpi.org/community/lists/devel/2016/02/18636.php