As PSM on master is still broken I applied it on 1.8.4. Unfortunately it
does not work. The error is the same as before.

Looking at your patch I would also expect that this is the correct fix
and I even tried to change ompi_mtl_psm_cancel() to always return
OMPI_SUCCESS. MPI_Cancel() still fails.

Looking at the PSM code it seems it can directly call exit(-1) and thus
terminating and never returning to Open MPI. I do not see any debug
output from Open MPI after "Cannot cancel send requests" from PSM.

                Adrian

On Thu, Jan 15, 2015 at 01:43:11PM -0500, George Bosilca wrote:
> >From the MPI standard perspective MPI_Cancel doesn't have to succeed, it
> can also gracefully fail. However, the PSM MTL diverges from the MPI
> standard and if a request cannot be canceled an error is returned. Here is
> a patch to fix this issue.
> 
> diff --git a/ompi/mca/mtl/psm/mtl_psm_cancel.c
> b/ompi/mca/mtl/psm/mtl_psm_cancel
> index 6da3386..277c761 100644
> --- a/ompi/mca/mtl/psm/mtl_psm_cancel.c
> +++ b/ompi/mca/mtl/psm/mtl_psm_cancel.c
> @@ -37,10 +37,8 @@ int ompi_mtl_psm_cancel(struct mca_mtl_base_module_t*
> mtl,
>      if(PSM_OK == err) {
>        mtl_request->ompi_req->req_status._cancelled = true;
>        mtl_psm_request->super.completion_callback(&mtl_psm_request->super);
> -      return OMPI_SUCCESS;
> -    } else {
> -      return OMPI_ERROR;
>      }
> +    return OMPI_SUCCESS;
>    } else if(PSM_MQ_INCOMPLETE == err) {
>      return OMPI_SUCCESS;
>    }
> 
>   George.
> 
> 
> On Thu, Jan 15, 2015 at 1:30 PM, Adrian Reber <adr...@lisas.de> wrote:
> 
> > Doing
> >
> > MPI_Isend()
> >
> > followed by a
> >
> > MPI_Cancel()
> >
> > fails on my PSM based system with 1.8.4 like this:
> >
> > n040108:0.1.Cannot cancel send requests (req=0x2b6279787f80)
> > n040108:0.0.Cannot cancel send requests (req=0x2b3a3dc92f80)
> > -------------------------------------------------------
> > Primary job  terminated normally, but 1 process returned
> > a non-zero exit code.. Per user-direction, the job has been aborted.
> > -------------------------------------------------------
> > --------------------------------------------------------------------------
> > mpirun detected that one or more processes exited with non-zero status,
> > thus causing
> > the job to be terminated. The first process to do so was:
> >
> >   Process name: [[58364,1],1]
> >   Exit code:    255
> > --------------------------------------------------------------------------
> >
> > Is this something PSM actually cannot do or an Open MPI error?
> >
> >                 Adrian
> > _______________________________________________
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post:
> > http://www.open-mpi.org/community/lists/devel/2015/01/16783.php
> >

> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/01/16784.php

Reply via email to