Re: [OMPI devel] Another Open MPI <-> PSM question (MPI_Isend()/MPI_Cancel())

2015-01-16 Thread Adrian Reber
See my comment on https://github.com/open-mpi/ompi/issues/347

On Thu, Jan 15, 2015 at 05:01:00PM -0500, George Bosilca wrote:
> Skimming through the PSM code shows that the return values of the PSM
> functions are handled in most cases. Thus, removing the default error
> handler might not be such a bad idea.
> 
> Did you experience any trouble running with the version without the default
> error handler registered?
> 
>   George.
> 
> 
> On Thu, Jan 15, 2015 at 4:40 PM, Adrian Reber  wrote:
> 
> > It even says so in the code:
> >
> > ompi/mca/mtl/psm/mtl_psm.c:
> >
> >/* Default error handling is enabled, errors will not be returned to
> >  * user.  PSM prints the error and the offending endpoint's
> > hostname
> >  * and exits with -1 */
> >
> > Disabling the default PSM error handler makes MPI_Cancel() fail
> > gracefully. But then no error is handled anymore.
> >
> > Adrian
> >
> > On Thu, Jan 15, 2015 at 10:21:05PM +0100, Adrian Reber wrote:
> > > As PSM on master is still broken I applied it on 1.8.4. Unfortunately it
> > > does not work. The error is the same as before.
> > >
> > > Looking at your patch I would also expect that this is the correct fix
> > > and I even tried to change ompi_mtl_psm_cancel() to always return
> > > OMPI_SUCCESS. MPI_Cancel() still fails.
> > >
> > > Looking at the PSM code it seems it can directly call exit(-1) and thus
> > > terminating and never returning to Open MPI. I do not see any debug
> > > output from Open MPI after "Cannot cancel send requests" from PSM.
> > >
> > >   Adrian
> > >
> > > On Thu, Jan 15, 2015 at 01:43:11PM -0500, George Bosilca wrote:
> > > > >From the MPI standard perspective MPI_Cancel doesn't have to succeed,
> > it
> > > > can also gracefully fail. However, the PSM MTL diverges from the MPI
> > > > standard and if a request cannot be canceled an error is returned.
> > Here is
> > > > a patch to fix this issue.
> > > >
> > > > diff --git a/ompi/mca/mtl/psm/mtl_psm_cancel.c
> > > > b/ompi/mca/mtl/psm/mtl_psm_cancel
> > > > index 6da3386..277c761 100644
> > > > --- a/ompi/mca/mtl/psm/mtl_psm_cancel.c
> > > > +++ b/ompi/mca/mtl/psm/mtl_psm_cancel.c
> > > > @@ -37,10 +37,8 @@ int ompi_mtl_psm_cancel(struct
> > mca_mtl_base_module_t*
> > > > mtl,
> > > >  if(PSM_OK == err) {
> > > >mtl_request->ompi_req->req_status._cancelled = true;
> > > >
> > mtl_psm_request->super.completion_callback(_psm_request->super);
> > > > -  return OMPI_SUCCESS;
> > > > -} else {
> > > > -  return OMPI_ERROR;
> > > >  }
> > > > +return OMPI_SUCCESS;
> > > >} else if(PSM_MQ_INCOMPLETE == err) {
> > > >  return OMPI_SUCCESS;
> > > >}
> > > >
> > > >   George.
> > > >
> > > >
> > > > On Thu, Jan 15, 2015 at 1:30 PM, Adrian Reber  wrote:
> > > >
> > > > > Doing
> > > > >
> > > > > MPI_Isend()
> > > > >
> > > > > followed by a
> > > > >
> > > > > MPI_Cancel()
> > > > >
> > > > > fails on my PSM based system with 1.8.4 like this:
> > > > >
> > > > > n040108:0.1.Cannot cancel send requests (req=0x2b6279787f80)
> > > > > n040108:0.0.Cannot cancel send requests (req=0x2b3a3dc92f80)
> > > > > ---
> > > > > Primary job  terminated normally, but 1 process returned
> > > > > a non-zero exit code.. Per user-direction, the job has been aborted.
> > > > > ---
> > > > >
> > --
> > > > > mpirun detected that one or more processes exited with non-zero
> > status,
> > > > > thus causing
> > > > > the job to be terminated. The first process to do so was:
> > > > >
> > > > >   Process name: [[58364,1],1]
> > > > >   Exit code:255
> > > > >
> > --
> > > > >
> > > > > Is this something PSM actually cannot do or an Open MPI error?
> > > > >
> > > > > Adrian
> > > > > ___
> > > > > devel mailing list
> > > > > de...@open-mpi.org
> > > > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > > > Link to this post:
> > > > > http://www.open-mpi.org/community/lists/devel/2015/01/16783.php
> > > > >
> > >
> > > > ___
> > > > devel mailing list
> > > > de...@open-mpi.org
> > > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > > Link to this post:
> > http://www.open-mpi.org/community/lists/devel/2015/01/16784.php
> > > ___
> > > devel mailing list
> > > de...@open-mpi.org
> > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > Link to this post:
> > http://www.open-mpi.org/community/lists/devel/2015/01/16786.php


Re: [OMPI devel] Another Open MPI <-> PSM question (MPI_Isend()/MPI_Cancel())

2015-01-15 Thread Adrian Reber
It even says so in the code:

ompi/mca/mtl/psm/mtl_psm.c:

   /* Default error handling is enabled, errors will not be returned to
 * user.  PSM prints the error and the offending endpoint's hostname
 * and exits with -1 */

Disabling the default PSM error handler makes MPI_Cancel() fail
gracefully. But then no error is handled anymore.

Adrian

On Thu, Jan 15, 2015 at 10:21:05PM +0100, Adrian Reber wrote:
> As PSM on master is still broken I applied it on 1.8.4. Unfortunately it
> does not work. The error is the same as before.
> 
> Looking at your patch I would also expect that this is the correct fix
> and I even tried to change ompi_mtl_psm_cancel() to always return
> OMPI_SUCCESS. MPI_Cancel() still fails.
> 
> Looking at the PSM code it seems it can directly call exit(-1) and thus
> terminating and never returning to Open MPI. I do not see any debug
> output from Open MPI after "Cannot cancel send requests" from PSM.
> 
>   Adrian
> 
> On Thu, Jan 15, 2015 at 01:43:11PM -0500, George Bosilca wrote:
> > >From the MPI standard perspective MPI_Cancel doesn't have to succeed, it
> > can also gracefully fail. However, the PSM MTL diverges from the MPI
> > standard and if a request cannot be canceled an error is returned. Here is
> > a patch to fix this issue.
> > 
> > diff --git a/ompi/mca/mtl/psm/mtl_psm_cancel.c
> > b/ompi/mca/mtl/psm/mtl_psm_cancel
> > index 6da3386..277c761 100644
> > --- a/ompi/mca/mtl/psm/mtl_psm_cancel.c
> > +++ b/ompi/mca/mtl/psm/mtl_psm_cancel.c
> > @@ -37,10 +37,8 @@ int ompi_mtl_psm_cancel(struct mca_mtl_base_module_t*
> > mtl,
> >  if(PSM_OK == err) {
> >mtl_request->ompi_req->req_status._cancelled = true;
> >mtl_psm_request->super.completion_callback(_psm_request->super);
> > -  return OMPI_SUCCESS;
> > -} else {
> > -  return OMPI_ERROR;
> >  }
> > +return OMPI_SUCCESS;
> >} else if(PSM_MQ_INCOMPLETE == err) {
> >  return OMPI_SUCCESS;
> >}
> > 
> >   George.
> > 
> > 
> > On Thu, Jan 15, 2015 at 1:30 PM, Adrian Reber  wrote:
> > 
> > > Doing
> > >
> > > MPI_Isend()
> > >
> > > followed by a
> > >
> > > MPI_Cancel()
> > >
> > > fails on my PSM based system with 1.8.4 like this:
> > >
> > > n040108:0.1.Cannot cancel send requests (req=0x2b6279787f80)
> > > n040108:0.0.Cannot cancel send requests (req=0x2b3a3dc92f80)
> > > ---
> > > Primary job  terminated normally, but 1 process returned
> > > a non-zero exit code.. Per user-direction, the job has been aborted.
> > > ---
> > > --
> > > mpirun detected that one or more processes exited with non-zero status,
> > > thus causing
> > > the job to be terminated. The first process to do so was:
> > >
> > >   Process name: [[58364,1],1]
> > >   Exit code:255
> > > --
> > >
> > > Is this something PSM actually cannot do or an Open MPI error?
> > >
> > > Adrian
> > > ___
> > > devel mailing list
> > > de...@open-mpi.org
> > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > Link to this post:
> > > http://www.open-mpi.org/community/lists/devel/2015/01/16783.php
> > >
> 
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post: 
> > http://www.open-mpi.org/community/lists/devel/2015/01/16784.php
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/01/16786.php

Adrian

-- 
Adrian Reber http://lisas.de/~adrian/
C-3PO: 
Don't call me a mindless philosopher, you overweight
glob of grease!


Re: [OMPI devel] Another Open MPI <-> PSM question (MPI_Isend()/MPI_Cancel())

2015-01-15 Thread Adrian Reber
As PSM on master is still broken I applied it on 1.8.4. Unfortunately it
does not work. The error is the same as before.

Looking at your patch I would also expect that this is the correct fix
and I even tried to change ompi_mtl_psm_cancel() to always return
OMPI_SUCCESS. MPI_Cancel() still fails.

Looking at the PSM code it seems it can directly call exit(-1) and thus
terminating and never returning to Open MPI. I do not see any debug
output from Open MPI after "Cannot cancel send requests" from PSM.

Adrian

On Thu, Jan 15, 2015 at 01:43:11PM -0500, George Bosilca wrote:
> >From the MPI standard perspective MPI_Cancel doesn't have to succeed, it
> can also gracefully fail. However, the PSM MTL diverges from the MPI
> standard and if a request cannot be canceled an error is returned. Here is
> a patch to fix this issue.
> 
> diff --git a/ompi/mca/mtl/psm/mtl_psm_cancel.c
> b/ompi/mca/mtl/psm/mtl_psm_cancel
> index 6da3386..277c761 100644
> --- a/ompi/mca/mtl/psm/mtl_psm_cancel.c
> +++ b/ompi/mca/mtl/psm/mtl_psm_cancel.c
> @@ -37,10 +37,8 @@ int ompi_mtl_psm_cancel(struct mca_mtl_base_module_t*
> mtl,
>  if(PSM_OK == err) {
>mtl_request->ompi_req->req_status._cancelled = true;
>mtl_psm_request->super.completion_callback(_psm_request->super);
> -  return OMPI_SUCCESS;
> -} else {
> -  return OMPI_ERROR;
>  }
> +return OMPI_SUCCESS;
>} else if(PSM_MQ_INCOMPLETE == err) {
>  return OMPI_SUCCESS;
>}
> 
>   George.
> 
> 
> On Thu, Jan 15, 2015 at 1:30 PM, Adrian Reber  wrote:
> 
> > Doing
> >
> > MPI_Isend()
> >
> > followed by a
> >
> > MPI_Cancel()
> >
> > fails on my PSM based system with 1.8.4 like this:
> >
> > n040108:0.1.Cannot cancel send requests (req=0x2b6279787f80)
> > n040108:0.0.Cannot cancel send requests (req=0x2b3a3dc92f80)
> > ---
> > Primary job  terminated normally, but 1 process returned
> > a non-zero exit code.. Per user-direction, the job has been aborted.
> > ---
> > --
> > mpirun detected that one or more processes exited with non-zero status,
> > thus causing
> > the job to be terminated. The first process to do so was:
> >
> >   Process name: [[58364,1],1]
> >   Exit code:255
> > --
> >
> > Is this something PSM actually cannot do or an Open MPI error?
> >
> > Adrian
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post:
> > http://www.open-mpi.org/community/lists/devel/2015/01/16783.php
> >

> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/01/16784.php


[OMPI devel] Another Open MPI <-> PSM question (MPI_Isend()/MPI_Cancel())

2015-01-15 Thread Adrian Reber
Doing 

MPI_Isend()

followed by a

MPI_Cancel()

fails on my PSM based system with 1.8.4 like this:

n040108:0.1.Cannot cancel send requests (req=0x2b6279787f80)
n040108:0.0.Cannot cancel send requests (req=0x2b3a3dc92f80)
---
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
---
--
mpirun detected that one or more processes exited with non-zero status,
thus causing
the job to be terminated. The first process to do so was:

  Process name: [[58364,1],1]
  Exit code:255
--

Is this something PSM actually cannot do or an Open MPI error?

Adrian