Ralph Castain wrote:
In reviewing last night's MTT tests for the 1.3 branch, I am seeing
several segfault failures in the shared memory BTL when using large
messages. This occurred on both IU's sif machine and on Sun's tests.
Here is a typical stack from MTT:
MPITEST info (0): Starting MPI_Sendrecv: Root to all model test
[burl-ct-v20z-13:14699] *** Process received signal ***
[burl-ct-v20z-13:14699] Signal: Segmentation fault (11)
[burl-ct-v20z-13:14699] Signal code: (128)
[burl-ct-v20z-13:14699] Failing at address: (nil)
[burl-ct-v20z-13:14699] [ 0] /lib64/tls/libpthread.so.0 [0x2a960bc720]
[burl-ct-v20z-13:14699] [ 1]
/workspace/em162155/hpc/mtt-scratch/burl-ct-v20z-12/ompi-tarball-
testing/installs/ZCcL/install/lib/lib64/openmpi/
mca_btl_sm.so(mca_btl_sm_send+0x7b)
[0x2a9786a7d3]
[burl-ct-v20z-13:14699] [ 2]
/workspace/em162155/hpc/mtt-scratch/burl-ct-v20z-12/ompi-tarball-
testing/installs/ZCcL/install/lib/lib64/openmpi/
mca_pml_ob1.so(mca_pml_ob1_send_request_start_copy+0x5b2)
[0x2a97453942]
[burl-ct-v20z-13:14699] [ 3]
/workspace/em162155/hpc/mtt-scratch/burl-ct-v20z-12/ompi-tarball-
testing/installs/ZCcL/install/lib/lib64/openmpi/
mca_pml_ob1.so(mca_pml_ob1_isend+0x4f2)
[0x2a9744b446]
[burl-ct-v20z-13:14699] [ 4]
/workspace/em162155/hpc/mtt-scratch/burl-ct-v20z-12/ompi-tarball-
testing/installs/ZCcL/install/lib/lib64/openmpi/
mca_coll_tuned.so(ompi_coll_tuned_sendrecv_actual+0x7e)
[0x2a98120bca]
[burl-ct-v20z-13:14699] [ 5]
/workspace/em162155/hpc/mtt-scratch/burl-ct-v20z-12/ompi-tarball-
testing/installs/ZCcL/install/lib/lib64/openmpi/
mca_coll_tuned.so(ompi_coll_tuned_barrier_intra_recursivedoubling+0x119)
[0x2a9812b111]
[burl-ct-v20z-13:14699] [ 6]
/workspace/em162155/hpc/mtt-scratch/burl-ct-v20z-12/ompi-tarball-
testing/installs/ZCcL/install/lib/lib64/libmpi.so.0(PMPI_Barrier+0x8e)
[0x2a9584ca42]
[burl-ct-v20z-13:14699] [ 7] src/MPI_Sendrecv_rtoa_c [0x403009]
[burl-ct-v20z-13:14699] [ 8] /lib64/tls/libc.so.6(__libc_start_main
+0xea) [0x2a961e0aaa]
[burl-ct-v20z-13:14699] [ 9] src/MPI_Sendrecv_rtoa_c(strtok+0x66)
[0x4019f2]
[burl-ct-v20z-13:14699] *** End of error message ***
[burl-ct-v20z-12][[13280,1],0][btl_tcp_endpoint.c:
456:mca_btl_tcp_endpoint_recv_blocking] recv(13)
failed: Connection reset by peer (104)
--------------------------------------------------------------------------
mpirun noticed that process rank 2 with PID 14699 on node burl-ct-
v20z-13 exited on signal 11
(Segmentation fault).
--------------------------------------------------------------------------
Seems like this is something we need to address before release - yes?
I don't know if this needs to be addressed before release, but it was my
impression that we've been living with these errors for a long time.
They're intermittent (1% incidence rate????) and stacks come through
coll_tuned or coll_hierarch or something and end up in the sm BTL. We
discussed them not too long ago on this list. They predate 1.3.2. I
think Terry said they seem hard to reproduce outside of MTT. (Terry is
out this week.)
Anyhow, my impression was that these were not new with this release.
Would be nice to get off the books in any case. Need to figure out how
to improve reproducibility and then dive into coll/sm stuff.