Hi,

colleague of mine noticed that the ucx pml error handling doesn't work
as expected when a message is truncated in MPI_Recv call. The truncation
error is returned in the error field of mpi status and the receive call
itself returns MPI_SUCCESS. This means that it's not possible to catch
errors in receive calls when using MPI_STATUS_IGNORE.

If I understand the MPI standard correctly this is not allowed behavior.
Only routines that wait or test the completion of several messages are
allowed to modify the error field. Also, the return status should be set
to MPI_ERR_IN_STATUS if the error field is used.

I understand that programs should not rely on the error status and use
probe if there is a risk that messages are truncated, but current
behavior can make debugging quite frustrating. For example, running
the attached truncation example using ucx pml doesn't raise any
errors but the result is incorrect:

  $ mpiexec -mca pml ucx -np 2 ./truncate
  Rank 1 received 0

Running with ob1 triggers the default error handling:

  $ mpiexec -mca pml ob1 -np 2 ./truncate
   *** An error occurred in MPI_Recv
   *** reported by process [584581121,1]
   *** on communicator MPI_COMM_WORLD
   *** MPI_ERR_TRUNCATE: message truncated
   *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
   ***    and potentially your MPI job)

I've tested this using versions 4.0.2a1, 4.0.3 and 4.0.4. After
checking the pml_ucx.c I assume that it's an implementation issue in
receive calls and not a single bug.

Best regards,
Sami Ilvonen

#include<stdio.h>
#include<stdlib.h>
#include<mpi.h>

#define MPI_CHECK(errcode)                             \
    if(errcode != MPI_SUCCESS) {                       \
    fprintf(stderr, "MPI error in %s at line %i\n",    \
            __FILE__, __LINE__);                       \
    MPI_Abort(MPI_COMM_WORLD, errcode);                \
    MPI_Finalize();                                    \
    exit(errcode);                                     \
    }

int main(int argc, char *argv[])
{
    int i, myid, ntasks;
    int buf_size = 100;
    int *send_buffer;
    int *recv_buffer;
    
    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &ntasks);
    MPI_Comm_rank(MPI_COMM_WORLD, &myid);

    /* Allocate message */
    send_buffer = (int *)malloc(sizeof(int) * buf_size);
    recv_buffer = (int *)malloc(sizeof(int) * buf_size);

    /* Initialize buffers */
    for (i = 0; i < buf_size; i++) {
        send_buffer[i] = 9;
        recv_buffer[i] = 0;
    }

    /* Send and receive */
    if (myid == 0) {
        MPI_CHECK( MPI_Send(send_buffer, buf_size, MPI_INT, 1, 11,
                            MPI_COMM_WORLD) );
    } else if (myid == 1) {
        MPI_CHECK( MPI_Recv(recv_buffer, buf_size - 1, MPI_INT, 0,
                            MPI_ANY_TAG, MPI_COMM_WORLD,
                            MPI_STATUS_IGNORE) );
        printf("Rank %i received %i\n", myid, recv_buffer[0]);
    }

    free(send_buffer);
    free(recv_buffer);
    MPI_Finalize();
    return 0;
}

Reply via email to