Takahiro, Indeed we were way to lax on canceling the requests. I modified your patch to correctly deal with the MEMCHECK macro (remove the call from the branch that will requires a completion function). The modified patch is attached below. I will commit asap.
Thanks, george. Index: ompi/mca/pml/ob1/pml_ob1_recvreq.c =================================================================== --- ompi/mca/pml/ob1/pml_ob1_recvreq.c (revision 26870) +++ ompi/mca/pml/ob1/pml_ob1_recvreq.c (working copy) @@ -3,7 +3,7 @@ * Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana * University Research and Technology * Corporation. All rights reserved. - * Copyright (c) 2004-2009 The University of Tennessee and The University + * Copyright (c) 2004-2012 The University of Tennessee and The University * of Tennessee Research Foundation. All rights * reserved. * Copyright (c) 2004-2008 High Performance Computing Center Stuttgart, @@ -15,6 +15,7 @@ * Copyright (c) 2012 NVIDIA Corporation. All rights reserved. * Copyright (c) 2011-2012 Los Alamos National Security, LLC. All rights * reserved. + * Copyright (c) 2012 FUJITSU LIMITED. All rights reserved. * $COPYRIGHT$ * * Additional copyrights may follow @@ -97,36 +98,26 @@ mca_pml_ob1_recv_request_t* request = (mca_pml_ob1_recv_request_t*)ompi_request; mca_pml_ob1_comm_t* comm = request->req_recv.req_base.req_comm->c_pml_comm; - if( true == ompi_request->req_complete ) { /* way to late to cancel this one */ - /* - * Receive request completed, make user buffer accessable. - */ - MEMCHECKER( - memchecker_call(&opal_memchecker_base_mem_defined, - request->req_recv.req_base.req_addr, - request->req_recv.req_base.req_count, - request->req_recv.req_base.req_datatype); - ); + if( true == request->req_match_received ) { /* way to late to cancel this one */ + assert( OMPI_ANY_TAG != ompi_request->req_status.MPI_TAG ); /* not matched isn't it */ return OMPI_SUCCESS; } /* The rest should be protected behind the match logic lock */ OPAL_THREAD_LOCK(&comm->matching_lock); - if( OMPI_ANY_TAG == ompi_request->req_status.MPI_TAG ) { /* the match has not been already done */ - if( request->req_recv.req_base.req_peer == OMPI_ANY_SOURCE ) { - opal_list_remove_item( &comm->wild_receives, (opal_list_item_t*)request ); - } else { - mca_pml_ob1_comm_proc_t* proc = comm->procs + request->req_recv.req_base.req_peer; - opal_list_remove_item(&proc->specific_receives, (opal_list_item_t*)request); - } - PERUSE_TRACE_COMM_EVENT( PERUSE_COMM_REQ_REMOVE_FROM_POSTED_Q, - &(request->req_recv.req_base), PERUSE_RECV ); - /** - * As now the PML is done with this request we have to force the pml_complete - * to true. Otherwise, the request will never be freed. - */ - request->req_recv.req_base.req_pml_complete = true; + if( request->req_recv.req_base.req_peer == OMPI_ANY_SOURCE ) { + opal_list_remove_item( &comm->wild_receives, (opal_list_item_t*)request ); + } else { + mca_pml_ob1_comm_proc_t* proc = comm->procs + request->req_recv.req_base.req_peer; + opal_list_remove_item(&proc->specific_receives, (opal_list_item_t*)request); } + PERUSE_TRACE_COMM_EVENT( PERUSE_COMM_REQ_REMOVE_FROM_POSTED_Q, + &(request->req_recv.req_base), PERUSE_RECV ); + /** + * As now the PML is done with this request we have to force the pml_complete + * to true. Otherwise, the request will never be freed. + */ + request->req_recv.req_base.req_pml_complete = true; OPAL_THREAD_UNLOCK(&comm->matching_lock); OPAL_THREAD_LOCK(&ompi_request_lock); @@ -138,7 +129,7 @@ MCA_PML_OB1_RECV_REQUEST_MPI_COMPLETE(request); OPAL_THREAD_UNLOCK(&ompi_request_lock); /* - * Receive request cancelled, make user buffer accessable. + * Receive request cancelled, make user buffer accessible. */ MEMCHECKER( memchecker_call(&opal_memchecker_base_mem_defined, On Jul 26, 2012, at 13:41 , Kawashima, Takahiro wrote: > Hi Open MPI developers, > > I found a small bug in Open MPI. > > See attached program cancelled.c. > In this program, rank 1 tries to cancel a MPI_Irecv and calls a MPI_Recv > instead if the cancellation succeeds. This program should terminate whether > the cancellation succeeds or not. But it leads a deadlock in MPI_Recv after > printing "MPI_Test_cancelled: 1". > I confirmed it works fine with MPICH2. > > The problem is in mca_pml_ob1_recv_request_cancel function in > ompi/mca/pml/ob1/pml_ob1_recvreq.c. It accepts the cancellation unless > the request has been completed. I think it should not accept the > cancellation if the request has been matched. If it want to accept the > cancellation, it must push the recv frag to the unexpected message queue > back and redo matching. Furthermore, the receive buffer must be reverted > if the received message has been written to the receive buffer partially > in a pipeline protocol. > > Attached patch cancel-recv.patch is a sample fix for this bug for Open MPI > trunk. Though this patch has 65 lines, main modifications are adding one > if-statement and deleting one if-statement. Other lines are just for > indent alignment. > I cannot confirm the MEMCHECKER part is correct. Could anyone review it > before committing? > > Regards, > > Takahiro Kawashima, > MPI development team, > Fujitsu > <cancelled.c><cancel-recv.patch><License.txt>_______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel