Should now be fixed in master

> On Feb 26, 2016, at 12:45 PM, Sylvain Jeaugey <sjeau...@nvidia.com> wrote:
> 
> No, the child processes are only calling MPIX_Query_cuda_support which is 
> just "return OPAL_CUDA_SUPPORT". I can reproduce the problem with "ls" (see 
> above).
> 
> I don't have the line numbers, but from the calling stack, the only way it 
> could segfault is that "&proct->stdinev->daemon" is wrong in 
> orte_iof_hnp_read_local_handler (orte/mca/iof/hnp/iof_hnp_read.c:145).
> 
> Which means that the cbdata passed from libevent to 
> orte_iof_hnp_read_local_handler() is wrong or was destroyed, freed, .... The 
> crash seems to happen after some ranks already finished (but others didn't 
> start yet).
> 
> Finally, I found how to reproduce it easily. You need to have orted do 3 
> things at the same time : process stdout (child processes writing to stdout), 
> stdin (I'm hitting enter to produce stdin to mpirun) and tcp connections 
> (mpirun between multiple nodes). If run within the node, I get no crash, if I 
> don't hit "enter", no crash. If I call "sleep 1" instead of "ls /", no crash.
> 
> So I run this loop :
>  while mpirun -host <two nodes at least> -np 6 ls /; do true; done
>  <works fine until you hit enter and you get the crash>
> 
> I'm not sure why MTT is reproducing the error ... does it write to mpirun 
> stdin ?
> 
> On 02/26/2016 11:46 AM, Ralph Castain wrote:
>> So the child processes are not calling orte_init or anything like that? I 
>> can check it - any chance you can give me a line number via a debug build?
>> 
>>> On Feb 26, 2016, at 11:42 AM, Sylvain Jeaugey <sjeau...@nvidia.com> wrote:
>>> 
>>> I got this strange crash on master this night running nv/mpix_test :
>>> 
>>> Signal: Segmentation fault (11)
>>> Signal code: Address not mapped (1)
>>> Failing at address: 0x50
>>> [ 0] /lib64/libpthread.so.0(+0xf710)[0x7f9f19a80710]
>>> [ 1] 
>>> /ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-4/installs/eGXW/install/lib/libopen-rte.so.0(orte_util_compare_name_fields+0x81)[0x7f9f1a88f6d7]
>>> [ 2] 
>>> /ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-4/installs/eGXW/install/lib/openmpi/mca_iof_hnp.so(orte_iof_hnp_read_local_handler+0x247)[0x7f9f1109b4ab]
>>> [ 3] 
>>> /ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-4/installs/eGXW/install/lib/libopen-pal.so.0(opal_libevent2022_event_base_loop+0xbf1)[0x7f9f1a5b68f1]
>>> [ 4] mpirun[0x405649][drossetti-ivy4:31651] [ 5] mpirun[0x403a48]
>>> [ 6] /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f9f196fbd1d]
>>> [ 7] mpirun[0x4038e9]
>>> *** End of error message ***
>>> 
>>> This test is not even calling MPI_Init/Finalize, only 
>>> MPIX_Query_cuda_support. So it is really an ORTE race condition, and the 
>>> problem is hard to reproduce. It takes sometimes more than 50 runs with 
>>> random sleep between runs to see the problem.
>>> 
>>> I don't even know if we want to fix that -- what do you think ?
>>> 
>>> Sylvain
>>> 
>>> 
>>> 
>>> -----------------------------------------------------------------------------------
>>> This email message is for the sole use of the intended recipient(s) and may 
>>> contain
>>> confidential information.  Any unauthorized review, use, disclosure or 
>>> distribution
>>> is prohibited.  If you are not the intended recipient, please contact the 
>>> sender by
>>> reply email and destroy all copies of the original message.
>>> -----------------------------------------------------------------------------------
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2016/02/18635.php
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2016/02/18636.php
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2016/02/18637.php

Reply via email to