See my comment on https://github.com/open-mpi/ompi/issues/347

On Thu, Jan 15, 2015 at 05:01:00PM -0500, George Bosilca wrote:
> Skimming through the PSM code shows that the return values of the PSM
> functions are handled in most cases. Thus, removing the default error
> handler might not be such a bad idea.
> 
> Did you experience any trouble running with the version without the default
> error handler registered?
> 
>   George.
> 
> 
> On Thu, Jan 15, 2015 at 4:40 PM, Adrian Reber <adr...@lisas.de> wrote:
> 
> > It even says so in the code:
> >
> > ompi/mca/mtl/psm/mtl_psm.c:
> >
> >        /* Default error handling is enabled, errors will not be returned to
> >          * user.  PSM prints the error and the offending endpoint's
> > hostname
> >          * and exits with -1 */
> >
> > Disabling the default PSM error handler makes MPI_Cancel() fail
> > gracefully. But then no error is handled anymore.
> >
> >                 Adrian
> >
> > On Thu, Jan 15, 2015 at 10:21:05PM +0100, Adrian Reber wrote:
> > > As PSM on master is still broken I applied it on 1.8.4. Unfortunately it
> > > does not work. The error is the same as before.
> > >
> > > Looking at your patch I would also expect that this is the correct fix
> > > and I even tried to change ompi_mtl_psm_cancel() to always return
> > > OMPI_SUCCESS. MPI_Cancel() still fails.
> > >
> > > Looking at the PSM code it seems it can directly call exit(-1) and thus
> > > terminating and never returning to Open MPI. I do not see any debug
> > > output from Open MPI after "Cannot cancel send requests" from PSM.
> > >
> > >               Adrian
> > >
> > > On Thu, Jan 15, 2015 at 01:43:11PM -0500, George Bosilca wrote:
> > > > >From the MPI standard perspective MPI_Cancel doesn't have to succeed,
> > it
> > > > can also gracefully fail. However, the PSM MTL diverges from the MPI
> > > > standard and if a request cannot be canceled an error is returned.
> > Here is
> > > > a patch to fix this issue.
> > > >
> > > > diff --git a/ompi/mca/mtl/psm/mtl_psm_cancel.c
> > > > b/ompi/mca/mtl/psm/mtl_psm_cancel
> > > > index 6da3386..277c761 100644
> > > > --- a/ompi/mca/mtl/psm/mtl_psm_cancel.c
> > > > +++ b/ompi/mca/mtl/psm/mtl_psm_cancel.c
> > > > @@ -37,10 +37,8 @@ int ompi_mtl_psm_cancel(struct
> > mca_mtl_base_module_t*
> > > > mtl,
> > > >      if(PSM_OK == err) {
> > > >        mtl_request->ompi_req->req_status._cancelled = true;
> > > >
> > mtl_psm_request->super.completion_callback(&mtl_psm_request->super);
> > > > -      return OMPI_SUCCESS;
> > > > -    } else {
> > > > -      return OMPI_ERROR;
> > > >      }
> > > > +    return OMPI_SUCCESS;
> > > >    } else if(PSM_MQ_INCOMPLETE == err) {
> > > >      return OMPI_SUCCESS;
> > > >    }
> > > >
> > > >   George.
> > > >
> > > >
> > > > On Thu, Jan 15, 2015 at 1:30 PM, Adrian Reber <adr...@lisas.de> wrote:
> > > >
> > > > > Doing
> > > > >
> > > > > MPI_Isend()
> > > > >
> > > > > followed by a
> > > > >
> > > > > MPI_Cancel()
> > > > >
> > > > > fails on my PSM based system with 1.8.4 like this:
> > > > >
> > > > > n040108:0.1.Cannot cancel send requests (req=0x2b6279787f80)
> > > > > n040108:0.0.Cannot cancel send requests (req=0x2b3a3dc92f80)
> > > > > -------------------------------------------------------
> > > > > Primary job  terminated normally, but 1 process returned
> > > > > a non-zero exit code.. Per user-direction, the job has been aborted.
> > > > > -------------------------------------------------------
> > > > >
> > --------------------------------------------------------------------------
> > > > > mpirun detected that one or more processes exited with non-zero
> > status,
> > > > > thus causing
> > > > > the job to be terminated. The first process to do so was:
> > > > >
> > > > >   Process name: [[58364,1],1]
> > > > >   Exit code:    255
> > > > >
> > --------------------------------------------------------------------------
> > > > >
> > > > > Is this something PSM actually cannot do or an Open MPI error?
> > > > >
> > > > >                 Adrian
> > > > > _______________________________________________
> > > > > devel mailing list
> > > > > de...@open-mpi.org
> > > > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > > > Link to this post:
> > > > > http://www.open-mpi.org/community/lists/devel/2015/01/16783.php
> > > > >
> > >
> > > > _______________________________________________
> > > > devel mailing list
> > > > de...@open-mpi.org
> > > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > > Link to this post:
> > http://www.open-mpi.org/community/lists/devel/2015/01/16784.php
> > > _______________________________________________
> > > devel mailing list
> > > de...@open-mpi.org
> > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > Link to this post:
> > http://www.open-mpi.org/community/lists/devel/2015/01/16786.php

Reply via email to