Hi, colleague of mine noticed that the ucx pml error handling doesn't work as expected when a message is truncated in MPI_Recv call. The truncation error is returned in the error field of mpi status and the receive call itself returns MPI_SUCCESS. This means that it's not possible to catch errors in receive calls when using MPI_STATUS_IGNORE.
If I understand the MPI standard correctly this is not allowed behavior. Only routines that wait or test the completion of several messages are allowed to modify the error field. Also, the return status should be set to MPI_ERR_IN_STATUS if the error field is used. I understand that programs should not rely on the error status and use probe if there is a risk that messages are truncated, but current behavior can make debugging quite frustrating. For example, running the attached truncation example using ucx pml doesn't raise any errors but the result is incorrect: $ mpiexec -mca pml ucx -np 2 ./truncate Rank 1 received 0 Running with ob1 triggers the default error handling: $ mpiexec -mca pml ob1 -np 2 ./truncate *** An error occurred in MPI_Recv *** reported by process [584581121,1] *** on communicator MPI_COMM_WORLD *** MPI_ERR_TRUNCATE: message truncated *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, *** and potentially your MPI job) I've tested this using versions 4.0.2a1, 4.0.3 and 4.0.4. After checking the pml_ucx.c I assume that it's an implementation issue in receive calls and not a single bug. Best regards, Sami Ilvonen
#include<stdio.h> #include<stdlib.h> #include<mpi.h> #define MPI_CHECK(errcode) \ if(errcode != MPI_SUCCESS) { \ fprintf(stderr, "MPI error in %s at line %i\n", \ __FILE__, __LINE__); \ MPI_Abort(MPI_COMM_WORLD, errcode); \ MPI_Finalize(); \ exit(errcode); \ } int main(int argc, char *argv[]) { int i, myid, ntasks; int buf_size = 100; int *send_buffer; int *recv_buffer; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &ntasks); MPI_Comm_rank(MPI_COMM_WORLD, &myid); /* Allocate message */ send_buffer = (int *)malloc(sizeof(int) * buf_size); recv_buffer = (int *)malloc(sizeof(int) * buf_size); /* Initialize buffers */ for (i = 0; i < buf_size; i++) { send_buffer[i] = 9; recv_buffer[i] = 0; } /* Send and receive */ if (myid == 0) { MPI_CHECK( MPI_Send(send_buffer, buf_size, MPI_INT, 1, 11, MPI_COMM_WORLD) ); } else if (myid == 1) { MPI_CHECK( MPI_Recv(recv_buffer, buf_size - 1, MPI_INT, 0, MPI_ANY_TAG, MPI_COMM_WORLD, MPI_STATUS_IGNORE) ); printf("Rank %i received %i\n", myid, recv_buffer[0]); } free(send_buffer); free(recv_buffer); MPI_Finalize(); return 0; }