As PSM on master is still broken I applied it on 1.8.4. Unfortunately it does not work. The error is the same as before.
Looking at your patch I would also expect that this is the correct fix and I even tried to change ompi_mtl_psm_cancel() to always return OMPI_SUCCESS. MPI_Cancel() still fails. Looking at the PSM code it seems it can directly call exit(-1) and thus terminating and never returning to Open MPI. I do not see any debug output from Open MPI after "Cannot cancel send requests" from PSM. Adrian On Thu, Jan 15, 2015 at 01:43:11PM -0500, George Bosilca wrote: > >From the MPI standard perspective MPI_Cancel doesn't have to succeed, it > can also gracefully fail. However, the PSM MTL diverges from the MPI > standard and if a request cannot be canceled an error is returned. Here is > a patch to fix this issue. > > diff --git a/ompi/mca/mtl/psm/mtl_psm_cancel.c > b/ompi/mca/mtl/psm/mtl_psm_cancel > index 6da3386..277c761 100644 > --- a/ompi/mca/mtl/psm/mtl_psm_cancel.c > +++ b/ompi/mca/mtl/psm/mtl_psm_cancel.c > @@ -37,10 +37,8 @@ int ompi_mtl_psm_cancel(struct mca_mtl_base_module_t* > mtl, > if(PSM_OK == err) { > mtl_request->ompi_req->req_status._cancelled = true; > mtl_psm_request->super.completion_callback(&mtl_psm_request->super); > - return OMPI_SUCCESS; > - } else { > - return OMPI_ERROR; > } > + return OMPI_SUCCESS; > } else if(PSM_MQ_INCOMPLETE == err) { > return OMPI_SUCCESS; > } > > George. > > > On Thu, Jan 15, 2015 at 1:30 PM, Adrian Reber <adr...@lisas.de> wrote: > > > Doing > > > > MPI_Isend() > > > > followed by a > > > > MPI_Cancel() > > > > fails on my PSM based system with 1.8.4 like this: > > > > n040108:0.1.Cannot cancel send requests (req=0x2b6279787f80) > > n040108:0.0.Cannot cancel send requests (req=0x2b3a3dc92f80) > > ------------------------------------------------------- > > Primary job terminated normally, but 1 process returned > > a non-zero exit code.. Per user-direction, the job has been aborted. > > ------------------------------------------------------- > > -------------------------------------------------------------------------- > > mpirun detected that one or more processes exited with non-zero status, > > thus causing > > the job to be terminated. The first process to do so was: > > > > Process name: [[58364,1],1] > > Exit code: 255 > > -------------------------------------------------------------------------- > > > > Is this something PSM actually cannot do or an Open MPI error? > > > > Adrian > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: > > http://www.open-mpi.org/community/lists/devel/2015/01/16783.php > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/01/16784.php