What's the story about calling MPI_Finalize without first calling
MPI_Buffer_detach?
If I do an MPI_Bsend followed by MPI_Finalize, the corresponding
MPI_Recv takes forever. In contrast, if I insert an MPI_Buffer_detach,
then performance is reasonable. I can imagine the explanation. I
suspect that MPI_Bsend leaves the message in a local buffer, and so you
need to progress the sender in order for the receive to complete.
MPI_Buffer_detach must progress more aggressively than MPI_Finalize.
1) Any guidance from MPI gurus regarding what is proper practice?
2) Any guidance from OMPI devels what sort of fix makes sense?
I attach a test case. On some platforms, the final delay can be on
order of a minute.
% mpif90 main.F90
% mpirun -np 2 -mca btl sm,self a.out
1 0.021321
0 0.066568
1 0.020978
0 0.061625
1 0.021969
0 0.062380
1 0.020938
0 0.064401
1 0.020759
0 4.098010 # yipes! last receive takes a long time!
% mpif90 -DDETACH main.F90
% mpirun -np 2 -mca btl sm,self a.out
1 0.020913
0 0.064076
1 0.020746
0 0.061015
1 0.020454
0 0.061780
1 0.020457
0 0.060776
1 0.020619
0 0.062484
include "mpif.h"
integer, parameter :: nbufbytes = 16000000, nsendbytes = 15892480
real(8) buf(nbufbytes/8), x(nsendbytes/8), t
real(8) buf2
integer mbufbytes
call MPI_Init(ier)
call MPI_Comm_size(MPI_COMM_WORLD,np,ier)
call MPI_Comm_rank(MPI_COMM_WORLD,me,ier)
buf = 0.d0
x = 0.d0
if ( me == 1 ) call MPI_Buffer_attach(buf, nbufbytes, ier)
do i = 1, 5
call MPI_Barrier(MPI_COMM_WORLD,ier)
t = MPI_Wtime()
if ( me == 0 ) call MPI_Recv
(x,nsendbytes,MPI_BYTE,1,343,MPI_COMM_WORLD,MPI_STATUS_IGNORE,ier)
if ( me == 1 ) call MPI_Bsend(x,nsendbytes,MPI_BYTE,0,343,MPI_COMM_WORLD,
ier)
t = MPI_Wtime() - t
write(6,'(i4,f12.6,f8.3)') me, t
end do
#ifdef DETACH
if ( me == 1 ) call MPI_Buffer_detach(buf2, mbufbytes, ier)
#endif
call MPI_Finalize(ier)
end