I got this strange crash on master this night running nv/mpix_test :

Signal: Segmentation fault (11)
Signal code: Address not mapped (1)
Failing at address: 0x50
[ 0] /lib64/libpthread.so.0(+0xf710)[0x7f9f19a80710]
[ 1] /ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-4/installs/eGXW/install/lib/libopen-rte.so.0(orte_util_compare_name_fields+0x81)[0x7f9f1a88f6d7] [ 2] /ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-4/installs/eGXW/install/lib/openmpi/mca_iof_hnp.so(orte_iof_hnp_read_local_handler+0x247)[0x7f9f1109b4ab] [ 3] /ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-4/installs/eGXW/install/lib/libopen-pal.so.0(opal_libevent2022_event_base_loop+0xbf1)[0x7f9f1a5b68f1]
[ 4] mpirun[0x405649][drossetti-ivy4:31651] [ 5] mpirun[0x403a48]
[ 6] /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f9f196fbd1d]
[ 7] mpirun[0x4038e9]
*** End of error message ***

This test is not even calling MPI_Init/Finalize, only MPIX_Query_cuda_support. So it is really an ORTE race condition, and the problem is hard to reproduce. It takes sometimes more than 50 runs with random sleep between runs to see the problem.

I don't even know if we want to fix that -- what do you think ?

Sylvain



-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------

Reply via email to