Should now be fixed in master
> On Feb 26, 2016, at 12:45 PM, Sylvain Jeaugey <sjeau...@nvidia.com> wrote:
>
> No, the child processes are only calling MPIX_Query_cuda_support which is
> just "return OPAL_CUDA_SUPPORT". I can reproduce the problem with "ls" (see
> above).
>
> I don't have the line numbers, but from the calling stack, the only way it
> could segfault is that "&proct->stdinev->daemon" is wrong in
> orte_iof_hnp_read_local_handler (orte/mca/iof/hnp/iof_hnp_read.c:145).
>
> Which means that the cbdata passed from libevent to
> orte_iof_hnp_read_local_handler() is wrong or was destroyed, freed, .... The
> crash seems to happen after some ranks already finished (but others didn't
> start yet).
>
> Finally, I found how to reproduce it easily. You need to have orted do 3
> things at the same time : process stdout (child processes writing to stdout),
> stdin (I'm hitting enter to produce stdin to mpirun) and tcp connections
> (mpirun between multiple nodes). If run within the node, I get no crash, if I
> don't hit "enter", no crash. If I call "sleep 1" instead of "ls /", no crash.
>
> So I run this loop :
> while mpirun -host <two nodes at least> -np 6 ls /; do true; done
> <works fine until you hit enter and you get the crash>
>
> I'm not sure why MTT is reproducing the error ... does it write to mpirun
> stdin ?
>
> On 02/26/2016 11:46 AM, Ralph Castain wrote:
>> So the child processes are not calling orte_init or anything like that? I
>> can check it - any chance you can give me a line number via a debug build?
>>
>>> On Feb 26, 2016, at 11:42 AM, Sylvain Jeaugey <sjeau...@nvidia.com> wrote:
>>>
>>> I got this strange crash on master this night running nv/mpix_test :
>>>
>>> Signal: Segmentation fault (11)
>>> Signal code: Address not mapped (1)
>>> Failing at address: 0x50
>>> [ 0] /lib64/libpthread.so.0(+0xf710)[0x7f9f19a80710]
>>> [ 1]
>>> /ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-4/installs/eGXW/install/lib/libopen-rte.so.0(orte_util_compare_name_fields+0x81)[0x7f9f1a88f6d7]
>>> [ 2]
>>> /ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-4/installs/eGXW/install/lib/openmpi/mca_iof_hnp.so(orte_iof_hnp_read_local_handler+0x247)[0x7f9f1109b4ab]
>>> [ 3]
>>> /ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-4/installs/eGXW/install/lib/libopen-pal.so.0(opal_libevent2022_event_base_loop+0xbf1)[0x7f9f1a5b68f1]
>>> [ 4] mpirun[0x405649][drossetti-ivy4:31651] [ 5] mpirun[0x403a48]
>>> [ 6] /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f9f196fbd1d]
>>> [ 7] mpirun[0x4038e9]
>>> *** End of error message ***
>>>
>>> This test is not even calling MPI_Init/Finalize, only
>>> MPIX_Query_cuda_support. So it is really an ORTE race condition, and the
>>> problem is hard to reproduce. It takes sometimes more than 50 runs with
>>> random sleep between runs to see the problem.
>>>
>>> I don't even know if we want to fix that -- what do you think ?
>>>
>>> Sylvain
>>>
>>>
>>>
>>> -----------------------------------------------------------------------------------
>>> This email message is for the sole use of the intended recipient(s) and may
>>> contain
>>> confidential information. Any unauthorized review, use, disclosure or
>>> distribution
>>> is prohibited. If you are not the intended recipient, please contact the
>>> sender by
>>> reply email and destroy all copies of the original message.
>>> -----------------------------------------------------------------------------------
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/devel/2016/02/18635.php
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2016/02/18636.php
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/02/18637.php