Re: [OMPI devel] [patch] MPI_Cancel should not cancel a request if it has a matched recv frag
George, Thanks for review and commit! I've confirmed your modification. Takahiro Kawashima, MPI development team, Fujitsu > Takahiro, > > Indeed we were way to lax on canceling the requests. I modified your patch to > correctly deal with the MEMCHECK macro (remove the call from the branch that > will requires a completion function). The modified patch is attached below. I > will commit asap. > > Thanks, > george. > > > Index: ompi/mca/pml/ob1/pml_ob1_recvreq.c > === > --- ompi/mca/pml/ob1/pml_ob1_recvreq.c(revision 26870) > +++ ompi/mca/pml/ob1/pml_ob1_recvreq.c(working copy) > @@ -3,7 +3,7 @@ > * Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana > * University Research and Technology > * Corporation. All rights reserved. > - * Copyright (c) 2004-2009 The University of Tennessee and The University > + * Copyright (c) 2004-2012 The University of Tennessee and The University > * of Tennessee Research Foundation. All rights > * reserved. > * Copyright (c) 2004-2008 High Performance Computing Center Stuttgart, > @@ -15,6 +15,7 @@ > * Copyright (c) 2012 NVIDIA Corporation. All rights reserved. > * Copyright (c) 2011-2012 Los Alamos National Security, LLC. All rights > * reserved. > + * Copyright (c) 2012 FUJITSU LIMITED. All rights reserved. > * $COPYRIGHT$ > * > * Additional copyrights may follow > @@ -97,36 +98,26 @@ > mca_pml_ob1_recv_request_t* request = > (mca_pml_ob1_recv_request_t*)ompi_request; > mca_pml_ob1_comm_t* comm = > request->req_recv.req_base.req_comm->c_pml_comm; > > -if( true == ompi_request->req_complete ) { /* way to late to cancel this > one */ > -/* > - * Receive request completed, make user buffer accessable. > - */ > -MEMCHECKER( > -memchecker_call(_memchecker_base_mem_defined, > -request->req_recv.req_base.req_addr, > -request->req_recv.req_base.req_count, > -request->req_recv.req_base.req_datatype); > -); > +if( true == request->req_match_received ) { /* way to late to cancel > this one */ > +assert( OMPI_ANY_TAG != ompi_request->req_status.MPI_TAG ); /* not > matched isn't it */ > return OMPI_SUCCESS; > } > > /* The rest should be protected behind the match logic lock */ > OPAL_THREAD_LOCK(>matching_lock); > -if( OMPI_ANY_TAG == ompi_request->req_status.MPI_TAG ) { /* the match > has not been already done */ > - if( request->req_recv.req_base.req_peer == OMPI_ANY_SOURCE ) { > - opal_list_remove_item( >wild_receives, > (opal_list_item_t*)request ); > - } else { > - mca_pml_ob1_comm_proc_t* proc = comm->procs + > request->req_recv.req_base.req_peer; > - opal_list_remove_item(>specific_receives, > (opal_list_item_t*)request); > - } > - PERUSE_TRACE_COMM_EVENT( PERUSE_COMM_REQ_REMOVE_FROM_POSTED_Q, > -&(request->req_recv.req_base), PERUSE_RECV ); > - /** > -* As now the PML is done with this request we have to force the > pml_complete > -* to true. Otherwise, the request will never be freed. > -*/ > - request->req_recv.req_base.req_pml_complete = true; > +if( request->req_recv.req_base.req_peer == OMPI_ANY_SOURCE ) { > +opal_list_remove_item( >wild_receives, > (opal_list_item_t*)request ); > +} else { > +mca_pml_ob1_comm_proc_t* proc = comm->procs + > request->req_recv.req_base.req_peer; > +opal_list_remove_item(>specific_receives, > (opal_list_item_t*)request); > } > +PERUSE_TRACE_COMM_EVENT( PERUSE_COMM_REQ_REMOVE_FROM_POSTED_Q, > + &(request->req_recv.req_base), PERUSE_RECV ); > +/** > + * As now the PML is done with this request we have to force the > pml_complete > + * to true. Otherwise, the request will never be freed. > + */ > +request->req_recv.req_base.req_pml_complete = true; > OPAL_THREAD_UNLOCK(>matching_lock); > > OPAL_THREAD_LOCK(_request_lock); > @@ -138,7 +129,7 @@ > MCA_PML_OB1_RECV_REQUEST_MPI_COMPLETE(request); > OPAL_THREAD_UNLOCK(_request_lock); > /* > - * Receive request cancelled, make user buffer accessable. > + * Receive request cancelled, make user buffer accessible. > */ > MEMCHECKER( > memchecker_call(_memchecker_base_mem_defined, > > On Jul 26, 2012, at 13:41 , Kawashima, Takahiro wrote: > > > Hi Open MPI developers, > > > > I found a small bug in Open MPI. > > > > See attached program cancelled.c. > > In this program, rank 1 tries to cancel a MPI_Irecv and calls a MPI_Recv > > instead if the cancellation succeeds.
Re: [OMPI devel] [patch] MPI_Cancel should not cancel a request if it has a matched recv frag
OK, so this is only for receive, and not for send, I take it. Should have looked closer. Rich -Original Message- From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] On Behalf Of George Bosilca Sent: Thursday, July 26, 2012 10:47 AM To: Open MPI Developers Subject: Re: [OMPI devel] [patch] MPI_Cancel should not cancel a request if it has a matched recv frag Rich, There is no matching in this case. Canceling a receive operation is possible only up to the moment the request has been matched. Up to this point the sequence numbers of the peers are not used, so removing a non-matched request has no impact on the sequence number. george. On Jul 26, 2012, at 16:31 , Richard Graham wrote: > I do not see any resetting of sequence numbers. It has been a long time > since I have looked at the matching code, so don't know if the out-of-order > handling has been taken out. If not, the sequence number has to be dealt > with in some manner, or else there will be a gap in the arriving sequence > numbers, and the matching logic will prevent any further progress. > > Rich > > -Original Message- > From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] > On Behalf Of George Bosilca > Sent: Thursday, July 26, 2012 10:06 AM > To: Open MPI Developers > Subject: Re: [OMPI devel] [patch] MPI_Cancel should not cancel a > request if it has a matched recv frag > > Takahiro, > > Indeed we were way to lax on canceling the requests. I modified your patch to > correctly deal with the MEMCHECK macro (remove the call from the branch that > will requires a completion function). The modified patch is attached below. I > will commit asap. > > Thanks, >george. > > > Index: ompi/mca/pml/ob1/pml_ob1_recvreq.c > === > --- ompi/mca/pml/ob1/pml_ob1_recvreq.c(revision 26870) > +++ ompi/mca/pml/ob1/pml_ob1_recvreq.c(working copy) > @@ -3,7 +3,7 @@ > * Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana > * University Research and Technology > * Corporation. All rights reserved. > - * Copyright (c) 2004-2009 The University of Tennessee and The > University > + * Copyright (c) 2004-2012 The University of Tennessee and The > + University > * of Tennessee Research Foundation. All rights > * reserved. > * Copyright (c) 2004-2008 High Performance Computing Center Stuttgart, @@ > -15,6 +15,7 @@ > * Copyright (c) 2012 NVIDIA Corporation. All rights reserved. > * Copyright (c) 2011-2012 Los Alamos National Security, LLC. All rights > * reserved. > + * Copyright (c) 2012 FUJITSU LIMITED. All rights reserved. > * $COPYRIGHT$ > * > * Additional copyrights may follow > @@ -97,36 +98,26 @@ > mca_pml_ob1_recv_request_t* request = > (mca_pml_ob1_recv_request_t*)ompi_request; > mca_pml_ob1_comm_t* comm = > request->req_recv.req_base.req_comm->c_pml_comm; > > -if( true == ompi_request->req_complete ) { /* way to late to cancel this > one */ > -/* > - * Receive request completed, make user buffer accessable. > - */ > -MEMCHECKER( > -memchecker_call(_memchecker_base_mem_defined, > -request->req_recv.req_base.req_addr, > -request->req_recv.req_base.req_count, > -request->req_recv.req_base.req_datatype); > -); > +if( true == request->req_match_received ) { /* way to late to cancel > this one */ > +assert( OMPI_ANY_TAG != ompi_request->req_status.MPI_TAG ); > + /* not matched isn't it */ > return OMPI_SUCCESS; > } > > /* The rest should be protected behind the match logic lock */ > OPAL_THREAD_LOCK(>matching_lock); > -if( OMPI_ANY_TAG == ompi_request->req_status.MPI_TAG ) { /* the match > has not been already done */ > - if( request->req_recv.req_base.req_peer == OMPI_ANY_SOURCE ) { > - opal_list_remove_item( >wild_receives, > (opal_list_item_t*)request ); > - } else { > - mca_pml_ob1_comm_proc_t* proc = comm->procs + > request->req_recv.req_base.req_peer; > - opal_list_remove_item(>specific_receives, > (opal_list_item_t*)request); > - } > - PERUSE_TRACE_COMM_EVENT( PERUSE_COMM_REQ_REMOVE_FROM_POSTED_Q, > -&(request->req_recv.req_base), PERUSE_RECV ); > - /** > -* As now the PML is done with this request we have to force the > pml_complete > -* to true. Otherwise, the request will never be freed. > -*/ > - request->req_recv.req_base.req_pml_complete = true; > +if( request->req_recv.req_base.req_peer == OMPI_ANY_SOURCE ) { > +opal_list_remove_item( >wild_receives, > (opal_list_item_t*)request ); > +} else { > +
Re: [OMPI devel] [patch] MPI_Cancel should not cancel a request if it has a matched recv frag
Rich, There is no matching in this case. Canceling a receive operation is possible only up to the moment the request has been matched. Up to this point the sequence numbers of the peers are not used, so removing a non-matched request has no impact on the sequence number. george. On Jul 26, 2012, at 16:31 , Richard Graham wrote: > I do not see any resetting of sequence numbers. It has been a long time > since I have looked at the matching code, so don't know if the out-of-order > handling has been taken out. If not, the sequence number has to be dealt > with in some manner, or else there will be a gap in the arriving sequence > numbers, and the matching logic will prevent any further progress. > > Rich > > -Original Message- > From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] On > Behalf Of George Bosilca > Sent: Thursday, July 26, 2012 10:06 AM > To: Open MPI Developers > Subject: Re: [OMPI devel] [patch] MPI_Cancel should not cancel a request if > it has a matched recv frag > > Takahiro, > > Indeed we were way to lax on canceling the requests. I modified your patch to > correctly deal with the MEMCHECK macro (remove the call from the branch that > will requires a completion function). The modified patch is attached below. I > will commit asap. > > Thanks, >george. > > > Index: ompi/mca/pml/ob1/pml_ob1_recvreq.c > === > --- ompi/mca/pml/ob1/pml_ob1_recvreq.c(revision 26870) > +++ ompi/mca/pml/ob1/pml_ob1_recvreq.c(working copy) > @@ -3,7 +3,7 @@ > * Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana > * University Research and Technology > * Corporation. All rights reserved. > - * Copyright (c) 2004-2009 The University of Tennessee and The University > + * Copyright (c) 2004-2012 The University of Tennessee and The > + University > * of Tennessee Research Foundation. All rights > * reserved. > * Copyright (c) 2004-2008 High Performance Computing Center Stuttgart, @@ > -15,6 +15,7 @@ > * Copyright (c) 2012 NVIDIA Corporation. All rights reserved. > * Copyright (c) 2011-2012 Los Alamos National Security, LLC. All rights > * reserved. > + * Copyright (c) 2012 FUJITSU LIMITED. All rights reserved. > * $COPYRIGHT$ > * > * Additional copyrights may follow > @@ -97,36 +98,26 @@ > mca_pml_ob1_recv_request_t* request = > (mca_pml_ob1_recv_request_t*)ompi_request; > mca_pml_ob1_comm_t* comm = > request->req_recv.req_base.req_comm->c_pml_comm; > > -if( true == ompi_request->req_complete ) { /* way to late to cancel this > one */ > -/* > - * Receive request completed, make user buffer accessable. > - */ > -MEMCHECKER( > -memchecker_call(_memchecker_base_mem_defined, > -request->req_recv.req_base.req_addr, > -request->req_recv.req_base.req_count, > -request->req_recv.req_base.req_datatype); > -); > +if( true == request->req_match_received ) { /* way to late to cancel > this one */ > +assert( OMPI_ANY_TAG != ompi_request->req_status.MPI_TAG ); /* > + not matched isn't it */ > return OMPI_SUCCESS; > } > > /* The rest should be protected behind the match logic lock */ > OPAL_THREAD_LOCK(>matching_lock); > -if( OMPI_ANY_TAG == ompi_request->req_status.MPI_TAG ) { /* the match > has not been already done */ > - if( request->req_recv.req_base.req_peer == OMPI_ANY_SOURCE ) { > - opal_list_remove_item( >wild_receives, > (opal_list_item_t*)request ); > - } else { > - mca_pml_ob1_comm_proc_t* proc = comm->procs + > request->req_recv.req_base.req_peer; > - opal_list_remove_item(>specific_receives, > (opal_list_item_t*)request); > - } > - PERUSE_TRACE_COMM_EVENT( PERUSE_COMM_REQ_REMOVE_FROM_POSTED_Q, > -&(request->req_recv.req_base), PERUSE_RECV ); > - /** > -* As now the PML is done with this request we have to force the > pml_complete > -* to true. Otherwise, the request will never be freed. > -*/ > - request->req_recv.req_base.req_pml_complete = true; > +if( request->req_recv.req_base.req_peer == OMPI_ANY_SOURCE ) { > +opal_list_remove_item( >wild_receives, > (opal_list_item_t*)request ); > +} else { > +mca_pml_ob1_comm_proc_t* proc = comm->procs + > request->req_recv.req_base.req_peer; > +opal_list_remove_item(>specific_receives, > + (opal_list_item_t*)request); > } > +PERUSE_TRACE_COMM_EVENT( PERUSE_COMM_REQ_REMOVE_FROM_POSTED_Q, > + &(request->req_recv.req_base), PERUSE_RECV ); > +/** > + * As now the PML is done with this request we
Re: [OMPI devel] [patch] MPI_Cancel should not cancel a request if it has a matched recv frag
I do not see any resetting of sequence numbers. It has been a long time since I have looked at the matching code, so don't know if the out-of-order handling has been taken out. If not, the sequence number has to be dealt with in some manner, or else there will be a gap in the arriving sequence numbers, and the matching logic will prevent any further progress. Rich -Original Message- From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] On Behalf Of George Bosilca Sent: Thursday, July 26, 2012 10:06 AM To: Open MPI Developers Subject: Re: [OMPI devel] [patch] MPI_Cancel should not cancel a request if it has a matched recv frag Takahiro, Indeed we were way to lax on canceling the requests. I modified your patch to correctly deal with the MEMCHECK macro (remove the call from the branch that will requires a completion function). The modified patch is attached below. I will commit asap. Thanks, george. Index: ompi/mca/pml/ob1/pml_ob1_recvreq.c === --- ompi/mca/pml/ob1/pml_ob1_recvreq.c (revision 26870) +++ ompi/mca/pml/ob1/pml_ob1_recvreq.c (working copy) @@ -3,7 +3,7 @@ * Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana * University Research and Technology * Corporation. All rights reserved. - * Copyright (c) 2004-2009 The University of Tennessee and The University + * Copyright (c) 2004-2012 The University of Tennessee and The + University * of Tennessee Research Foundation. All rights * reserved. * Copyright (c) 2004-2008 High Performance Computing Center Stuttgart, @@ -15,6 +15,7 @@ * Copyright (c) 2012 NVIDIA Corporation. All rights reserved. * Copyright (c) 2011-2012 Los Alamos National Security, LLC. All rights * reserved. + * Copyright (c) 2012 FUJITSU LIMITED. All rights reserved. * $COPYRIGHT$ * * Additional copyrights may follow @@ -97,36 +98,26 @@ mca_pml_ob1_recv_request_t* request = (mca_pml_ob1_recv_request_t*)ompi_request; mca_pml_ob1_comm_t* comm = request->req_recv.req_base.req_comm->c_pml_comm; -if( true == ompi_request->req_complete ) { /* way to late to cancel this one */ -/* - * Receive request completed, make user buffer accessable. - */ -MEMCHECKER( -memchecker_call(_memchecker_base_mem_defined, -request->req_recv.req_base.req_addr, -request->req_recv.req_base.req_count, -request->req_recv.req_base.req_datatype); -); +if( true == request->req_match_received ) { /* way to late to cancel this one */ +assert( OMPI_ANY_TAG != ompi_request->req_status.MPI_TAG ); /* + not matched isn't it */ return OMPI_SUCCESS; } /* The rest should be protected behind the match logic lock */ OPAL_THREAD_LOCK(>matching_lock); -if( OMPI_ANY_TAG == ompi_request->req_status.MPI_TAG ) { /* the match has not been already done */ - if( request->req_recv.req_base.req_peer == OMPI_ANY_SOURCE ) { - opal_list_remove_item( >wild_receives, (opal_list_item_t*)request ); - } else { - mca_pml_ob1_comm_proc_t* proc = comm->procs + request->req_recv.req_base.req_peer; - opal_list_remove_item(>specific_receives, (opal_list_item_t*)request); - } - PERUSE_TRACE_COMM_EVENT( PERUSE_COMM_REQ_REMOVE_FROM_POSTED_Q, -&(request->req_recv.req_base), PERUSE_RECV ); - /** -* As now the PML is done with this request we have to force the pml_complete -* to true. Otherwise, the request will never be freed. -*/ - request->req_recv.req_base.req_pml_complete = true; +if( request->req_recv.req_base.req_peer == OMPI_ANY_SOURCE ) { +opal_list_remove_item( >wild_receives, (opal_list_item_t*)request ); +} else { +mca_pml_ob1_comm_proc_t* proc = comm->procs + request->req_recv.req_base.req_peer; +opal_list_remove_item(>specific_receives, + (opal_list_item_t*)request); } +PERUSE_TRACE_COMM_EVENT( PERUSE_COMM_REQ_REMOVE_FROM_POSTED_Q, + &(request->req_recv.req_base), PERUSE_RECV ); +/** + * As now the PML is done with this request we have to force the pml_complete + * to true. Otherwise, the request will never be freed. + */ +request->req_recv.req_base.req_pml_complete = true; OPAL_THREAD_UNLOCK(>matching_lock); OPAL_THREAD_LOCK(_request_lock); @@ -138,7 +129,7 @@ MCA_PML_OB1_RECV_REQUEST_MPI_COMPLETE(request); OPAL_THREAD_UNLOCK(_request_lock); /* - * Receive request cancelled, make user buffer accessable. + * Receive request cancelled, make user buffer accessible. */ MEMCHECKER(
Re: [OMPI devel] [patch] MPI_Cancel should not cancel a request if it has a matched recv frag
Takahiro, Indeed we were way to lax on canceling the requests. I modified your patch to correctly deal with the MEMCHECK macro (remove the call from the branch that will requires a completion function). The modified patch is attached below. I will commit asap. Thanks, george. Index: ompi/mca/pml/ob1/pml_ob1_recvreq.c === --- ompi/mca/pml/ob1/pml_ob1_recvreq.c (revision 26870) +++ ompi/mca/pml/ob1/pml_ob1_recvreq.c (working copy) @@ -3,7 +3,7 @@ * Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana * University Research and Technology * Corporation. All rights reserved. - * Copyright (c) 2004-2009 The University of Tennessee and The University + * Copyright (c) 2004-2012 The University of Tennessee and The University * of Tennessee Research Foundation. All rights * reserved. * Copyright (c) 2004-2008 High Performance Computing Center Stuttgart, @@ -15,6 +15,7 @@ * Copyright (c) 2012 NVIDIA Corporation. All rights reserved. * Copyright (c) 2011-2012 Los Alamos National Security, LLC. All rights * reserved. + * Copyright (c) 2012 FUJITSU LIMITED. All rights reserved. * $COPYRIGHT$ * * Additional copyrights may follow @@ -97,36 +98,26 @@ mca_pml_ob1_recv_request_t* request = (mca_pml_ob1_recv_request_t*)ompi_request; mca_pml_ob1_comm_t* comm = request->req_recv.req_base.req_comm->c_pml_comm; -if( true == ompi_request->req_complete ) { /* way to late to cancel this one */ -/* - * Receive request completed, make user buffer accessable. - */ -MEMCHECKER( -memchecker_call(_memchecker_base_mem_defined, -request->req_recv.req_base.req_addr, -request->req_recv.req_base.req_count, -request->req_recv.req_base.req_datatype); -); +if( true == request->req_match_received ) { /* way to late to cancel this one */ +assert( OMPI_ANY_TAG != ompi_request->req_status.MPI_TAG ); /* not matched isn't it */ return OMPI_SUCCESS; } /* The rest should be protected behind the match logic lock */ OPAL_THREAD_LOCK(>matching_lock); -if( OMPI_ANY_TAG == ompi_request->req_status.MPI_TAG ) { /* the match has not been already done */ - if( request->req_recv.req_base.req_peer == OMPI_ANY_SOURCE ) { - opal_list_remove_item( >wild_receives, (opal_list_item_t*)request ); - } else { - mca_pml_ob1_comm_proc_t* proc = comm->procs + request->req_recv.req_base.req_peer; - opal_list_remove_item(>specific_receives, (opal_list_item_t*)request); - } - PERUSE_TRACE_COMM_EVENT( PERUSE_COMM_REQ_REMOVE_FROM_POSTED_Q, -&(request->req_recv.req_base), PERUSE_RECV ); - /** -* As now the PML is done with this request we have to force the pml_complete -* to true. Otherwise, the request will never be freed. -*/ - request->req_recv.req_base.req_pml_complete = true; +if( request->req_recv.req_base.req_peer == OMPI_ANY_SOURCE ) { +opal_list_remove_item( >wild_receives, (opal_list_item_t*)request ); +} else { +mca_pml_ob1_comm_proc_t* proc = comm->procs + request->req_recv.req_base.req_peer; +opal_list_remove_item(>specific_receives, (opal_list_item_t*)request); } +PERUSE_TRACE_COMM_EVENT( PERUSE_COMM_REQ_REMOVE_FROM_POSTED_Q, + &(request->req_recv.req_base), PERUSE_RECV ); +/** + * As now the PML is done with this request we have to force the pml_complete + * to true. Otherwise, the request will never be freed. + */ +request->req_recv.req_base.req_pml_complete = true; OPAL_THREAD_UNLOCK(>matching_lock); OPAL_THREAD_LOCK(_request_lock); @@ -138,7 +129,7 @@ MCA_PML_OB1_RECV_REQUEST_MPI_COMPLETE(request); OPAL_THREAD_UNLOCK(_request_lock); /* - * Receive request cancelled, make user buffer accessable. + * Receive request cancelled, make user buffer accessible. */ MEMCHECKER( memchecker_call(_memchecker_base_mem_defined, On Jul 26, 2012, at 13:41 , Kawashima, Takahiro wrote: > Hi Open MPI developers, > > I found a small bug in Open MPI. > > See attached program cancelled.c. > In this program, rank 1 tries to cancel a MPI_Irecv and calls a MPI_Recv > instead if the cancellation succeeds. This program should terminate whether > the cancellation succeeds or not. But it leads a deadlock in MPI_Recv after > printing "MPI_Test_cancelled: 1". > I confirmed it works fine with MPICH2. > > The problem is in mca_pml_ob1_recv_request_cancel function in > ompi/mca/pml/ob1/pml_ob1_recvreq.c. It accepts the cancellation unless > the request has been completed. I think it