I've ran into an issue while running hpl where a message has been sent
(in shared memory in this case) and the receiver calls iprobe but
doesn't see said message the first call to iprobe (even though it is
there) but does see it the second call to iprobe. Looking at
mca_pml_ob1_iprobe function and the calls it makes it looks like it
checks the unexpected queue for matches and if it doesn't find one it
sets the flag to 0 (no matches), then calls opal_progress and return.
This seems wrong to me since I would expect that the call to
opal_progress probably would pull in the message that the iprobe is
waiting for.
Am I correct in my reading of the code? It seems that maybe some sort
of check needs to be done after the call to opal_progress in
mca_pml_ob1_iprobe.
Attached is a simple program that shows the issue I am running into:
#include <mpi.h>
int main() {
int rank, src[2], dst[2], flag = 0;
int nxfers;
MPI_Status status;
MPI_Init(NULL, NULL);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (0 == rank) {
for (nxfers = 0; nxfers < 5; nxfers++)
MPI_Send(src, 2, MPI_INT, 1, 0, MPI_COMM_WORLD);
} else if (1 == rank) {
for (nxfers = 0; nxfers < 5; nxfers++) {
sleep(5);
flag = 0;
while (!flag) {
printf("iprobe...");
MPI_Iprobe(0, 0, MPI_COMM_WORLD, &flag, &status);
}
printf("\n");
MPI_Recv(dst, 2, MPI_INT, 0, 0, MPI_COMM_WORLD, &status);
}
}
MPI_Finalize();
}
--td