I got this strange crash on master this night running nv/mpix_test :
Signal: Segmentation fault (11)
Signal code: Address not mapped (1)
Failing at address: 0x50
[ 0] /lib64/libpthread.so.0(+0xf710)[0x7f9f19a80710]
[ 1]
/ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-4/installs/eGXW/install/lib/libopen-rte.so.0(orte_util_compare_name_fields+0x81)[0x7f9f1a88f6d7]
[ 2]
/ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-4/installs/eGXW/install/lib/openmpi/mca_iof_hnp.so(orte_iof_hnp_read_local_handler+0x247)[0x7f9f1109b4ab]
[ 3]
/ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-4/installs/eGXW/install/lib/libopen-pal.so.0(opal_libevent2022_event_base_loop+0xbf1)[0x7f9f1a5b68f1]
[ 4] mpirun[0x405649][drossetti-ivy4:31651] [ 5] mpirun[0x403a48]
[ 6] /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f9f196fbd1d]
[ 7] mpirun[0x4038e9]
*** End of error message ***
This test is not even calling MPI_Init/Finalize, only
MPIX_Query_cuda_support. So it is really an ORTE race condition, and the
problem is hard to reproduce. It takes sometimes more than 50 runs with
random sleep between runs to see the problem.
I don't even know if we want to fix that -- what do you think ?
Sylvain
-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may
contain
confidential information. Any unauthorized review, use, disclosure or
distribution
is prohibited. If you are not the intended recipient, please contact the
sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------