Re: [OMPI devel] 1.10.3rc MTT failures

2016-04-25 Thread Adrian Reber
Errors like that (Win::Get_attr: Got wrong value for disp unit) are from
my ppc64 machine: https://mtt.open-mpi.org/index.php?do_redir=2295

The MTT setup is checking out the tests from github directly:

[Test get: ibm]
module = SCM
scm_module = Git
scm_url = https://github.com/open-mpi/ompi-tests.git
scm_subdir = ibm

Not sure Ralph meant those errors. But they only happen on ppc64 and not
on x86_64 with a very similar mtt configuration file.

Adrian

On Mon, Apr 25, 2016 at 10:50:03PM +0900, Gilles Gouaillardet wrote:
> Cisco mtt looks clean
> since ompi_tests repo is private, it cannot be automatically pulled unless
> a password is saved (https) or a public key was uploaded to github (ssh)
> for that reason, I would not simply assume the latest test suite is used :-(
> and fwiw, Jeff uses an internally mirrored repo for ompi-tests, so it Cisco
> clusters should use the latest test suites.
> 
> Geoffrey,
> can you please comment on the config of the ibm cluster ?
> 
> Cheers,
> 
> Gilles
> 
> On Monday, April 25, 2016, Ralph Castain  > wrote:
> 
> > I don’t know - this isn’t on my machine, but rather in the weekend and
> > nightly MTT reports. I’m assuming folks are running the latest test suite,
> > but...
> >
> >
> > On Apr 25, 2016, at 6:20 AM, Gilles Gouaillardet <
> > gilles.gouaillar...@gmail.com> wrote:
> >
> > Ralph,
> >
> > can you make sure the ibm test suite is up to date ?
> > I pushed a fix for datatypes a few days ago, and it should be fine now.
> >
> > I will double check this tomorrow anyway
> >
> > Cheers,
> >
> > Gilles
> >
> > On Monday, April 25, 2016, Ralph Castain  wrote:
> >
> >> I’m seeing some consistent errors in the 1.10.3rc MTT results and would
> >> appreciate it if folks could check them out:
> >>
> >> ONESIDED:
> >> onesided/cxx_win_attr:
> >> [**ERROR**]: MPI_COMM_WORLD rank 0, file cxx_win_attr.cc:50:
> >> Win::Get_attr: Got wrong value for disp unit
> >> [**ERROR**]: MPI_COMM_WORLD rank 1, file cxx_win_attr.cc:50:
> >> Win::Get_attr: Got wrong value for disp
> >>
> >>
> >> DATATYPE:
> >> datatype/predefined-datatype-name
> >> MPI_LONG_LONG!= MPI_LONG_LONG_INT
> >>
> >>
> >> LOOP SPAWN:
> >> too many retries sending message to , giving up
> >>
> >> Thanks
> >> Ralph
> >>
> >> ___
> >> devel mailing list
> >> de...@open-mpi.org
> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >> Link to this post:
> >> http://www.open-mpi.org/community/lists/devel/2016/04/18809.php
> >
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post:
> > http://www.open-mpi.org/community/lists/devel/2016/04/18810.php
> >
> >
> >

> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2016/04/18812.php


[OMPI devel] Segmentation fault in opal_fifo (MTT)

2016-03-01 Thread Adrian Reber
I have seen it before but it was not reproducible. I have now two
segfaults in opal_fifo in today's MTT run on master and 2.x:


https://mtt.open-mpi.org/index.php?do_redir=2270
https://mtt.open-mpi.org/index.php?do_redir=2271

The thing that is strange about the MTT output is that MTT does not detect
the endianess and bitness correctly. It says on a x86_64 (Fedora 23)
system:

Endian: unknown
Bitness: 32

Endianess is not mentioned in mtt configuration file and bitness is
commented out like this:

#CN: bitness = 32

which is probably something I copied from another mtt configuration file
when initially creating mine.

Adrian


[OMPI devel] MTT setup updated to gcc-6.0 (pre)

2016-02-25 Thread Adrian Reber
I installed a pre-release gcc 6.0

 gcc version 6.0.0 20160221 (experimental) (GCC)

on my MTT systems (ppc64 and x86_64) and I now get a
test build failure:

https://mtt.open-mpi.org/index.php?do_redir=2269

Just as a FYI.

Adrian


Re: [OMPI devel] Checkpoint/restart + migration

2015-10-22 Thread Adrian Reber
On Thu, Oct 22, 2015 at 12:15:22PM +0200, Gianmario Pozzi wrote:
> My team and I are working on the possibility to checkpoint a process and
> restarting it on another node. We are using CRIU framework for the
> checkpoint/restart part, but we are facing some issues related to migration.
> 
> First of all: we found out that some attempts to C/R an OMPI process have
> been already made in the past. Is anything related to that still
> supported/available/working?

I was working on the CRIU <-> OpenMPI integration during 2013/2014. The
code is still available at:

https://github.com/open-mpi/ompi/tree/master/opal/mca/crs/criu

I was able to checkpoint and restart a process under OpenMPI's control:

http://lisas.de/~adrian/?p=926

>From what I have heard/read OpenMPI has probably had enough internal
changes that the Fault Tolerance framework is currently no longer
working which is needed to use the checkpoint/restart functionality.

In addition, CRIU has also changed a bit. I used the criu service daemon
to start the checkpoint. This service daemon no longer exists due to
security concerns:

https://lwn.net/Articles/658070/

So you either need to call the criu binary directly or you can use 'criu
swrk'.

Restore should be easier as criu now supports the option --inherit-fd
which should help to correctly re-route stdin/stdout/stderr.

Adrian


Re: [OMPI devel] MTT failures since the last few days on ppc64

2015-09-09 Thread Adrian Reber
After lots of make cleans it works again. Thanks.

On Wed, Sep 09, 2015 at 10:00:10AM +, Jeff Squyres (jsquyres) wrote:
> Try making clean (perhaps just in ompi/coll/ml) and trying again -- this 
> looks like it could just be a stale file in your tree.
> 
> > On Sep 9, 2015, at 5:41 AM, Adrian Reber <adr...@lisas.de> wrote:
> > 
> > I was about to try Gilles' patch but the current master checkout does
> > not build on my ppc64 system: (b79cffc73b88c2e5e2f2161e096c49aed5b9d2ed)
> > 
> > Making all in mca/coll/ml
> > make[2]: Entering directory '/home/adrian/ompi/build/ompi/mca/coll/ml'
> > /bin/sh ../../../../libtool  --tag=CC   --mode=link gcc -std=gnu99  -g 
> > -Wall -Wundef -Wno-long-long -Wsign-compare -Wmissing-prototypes 
> > -Wstrict-prototypes -Wcomment -pedantic 
> > -Werror-implicit-function-declaration -finline-functions 
> > -fno-strict-aliasing -pthread -module -avoid-version  -o mca_coll_ml.la 
> > -rpath /tmp/ompi/lib/openmpi coll_ml_module.lo coll_ml_allocation.lo 
> > coll_ml_barrier.lo coll_ml_bcast.lo coll_ml_component.lo 
> > coll_ml_copy_fns.lo coll_ml_descriptors.lo coll_ml_hier_algorithms.lo 
> > coll_ml_hier_algorithms_setup.lo coll_ml_hier_algorithms_bcast_setup.lo 
> > coll_ml_hier_algorithms_allreduce_setup.lo 
> > coll_ml_hier_algorithms_reduce_setup.lo 
> > coll_ml_hier_algorithms_common_setup.lo 
> > coll_ml_hier_algorithms_allgather_setup.lo 
> > coll_ml_hier_algorithm_memsync_setup.lo coll_ml_custom_utils.lo 
> > coll_ml_progress.lo coll_ml_reduce.lo coll_ml_allreduce.lo 
> > coll_ml_allgather.lo coll_ml_mca.lo coll_ml_lmngr.lo 
> > coll_ml_hier_algorithms_barrier_setup.lo coll_ml_select.lo coll_ml_memsyn
>  c.
> > lo coll_ml_lex.lo coll_ml_config.lo  -lrt  -lm -lutil   -lm -lutil  
> > libtool: link: `coll_ml_bcast.lo' is not a valid libtool object
> > Makefile:1860: recipe for target 'mca_coll_ml.la' failed
> > make[2]: *** [mca_coll_ml.la] Error 1
> > make[2]: Leaving directory '/home/adrian/ompi/build/ompi/mca/coll/ml'
> > Makefile:3366: recipe for target 'all-recursive' failed
> > 
> > 
> > 
> > 
> > On Tue, Sep 08, 2015 at 05:19:56PM +, Jeff Squyres (jsquyres) wrote:
> >> Thanks Adrian; I turned this into 
> >> https://github.com/open-mpi/ompi/issues/874.
> >> 
> >>> On Sep 8, 2015, at 9:56 AM, Adrian Reber <adr...@lisas.de> wrote:
> >>> 
> >>> Since a few days the MTT runs on my ppc64 systems are failing with:
> >>> 
> >>> [bimini:11716] *** Process received signal ***
> >>> [bimini:11716] Signal: Segmentation fault (11)
> >>> [bimini:11716] Signal code: Address not mapped (1)
> >>> [bimini:11716] Failing at address: (nil)[bimini:11716] [ 0] 
> >>> [0x3fffa2bb0448]
> >>> [bimini:11716] [ 1] /lib64/libc.so.6(+0xcb074)[0x3fffa27eb074] 
> >>> [bimini:11716] [ 2]
> >>> /home/adrian/mtt-scratch/installs/GubX/install/lib/libpmix.so.0(opal_pmix_pmix1xx_pmix_value_xfer-0x68758)[0x3fffa2158a10]
> >>>  [bimini:11716] [ 3]
> >>> /home/adrian/mtt-scratch/installs/GubX/install/lib/libpmix.so.0(OPAL_PMIX_PMIX1XX_PMIx_Put-0x48338)[0x3fffa2179f70]
> >>>  [bimini:11716] [ 4]
> >>> /home/adrian/mtt-scratch/installs/GubX/install/lib/openmpi/mca_pmix_pmix1xx.so(pmix1_put-0x27efc)[0x3fffa21d858c]
> >>> 
> >>> I think I do not see these kind of errors on any of the other MTT setups
> >>> so it might be ppc64 related. Just wanted to point it out.
> >>> 
> >>>   Adrian


Re: [OMPI devel] MTT failures since the last few days on ppc64

2015-09-09 Thread Adrian Reber
I was about to try Gilles' patch but the current master checkout does
not build on my ppc64 system: (b79cffc73b88c2e5e2f2161e096c49aed5b9d2ed)

Making all in mca/coll/ml
make[2]: Entering directory '/home/adrian/ompi/build/ompi/mca/coll/ml'
/bin/sh ../../../../libtool  --tag=CC   --mode=link gcc -std=gnu99  -g -Wall 
-Wundef -Wno-long-long -Wsign-compare -Wmissing-prototypes -Wstrict-prototypes 
-Wcomment -pedantic -Werror-implicit-function-declaration -finline-functions 
-fno-strict-aliasing -pthread -module -avoid-version  -o mca_coll_ml.la -rpath 
/tmp/ompi/lib/openmpi coll_ml_module.lo coll_ml_allocation.lo 
coll_ml_barrier.lo coll_ml_bcast.lo coll_ml_component.lo coll_ml_copy_fns.lo 
coll_ml_descriptors.lo coll_ml_hier_algorithms.lo 
coll_ml_hier_algorithms_setup.lo coll_ml_hier_algorithms_bcast_setup.lo 
coll_ml_hier_algorithms_allreduce_setup.lo 
coll_ml_hier_algorithms_reduce_setup.lo coll_ml_hier_algorithms_common_setup.lo 
coll_ml_hier_algorithms_allgather_setup.lo 
coll_ml_hier_algorithm_memsync_setup.lo coll_ml_custom_utils.lo 
coll_ml_progress.lo coll_ml_reduce.lo coll_ml_allreduce.lo coll_ml_allgather.lo 
coll_ml_mca.lo coll_ml_lmngr.lo coll_ml_hier_algorithms_barrier_setup.lo 
coll_ml_select.lo coll_ml_memsync.lo coll_ml_lex.lo coll_ml_config.lo  -lrt  
-lm -lutil   -lm -lutil  
libtool: link: `coll_ml_bcast.lo' is not a valid libtool object
Makefile:1860: recipe for target 'mca_coll_ml.la' failed
make[2]: *** [mca_coll_ml.la] Error 1
make[2]: Leaving directory '/home/adrian/ompi/build/ompi/mca/coll/ml'
Makefile:3366: recipe for target 'all-recursive' failed




On Tue, Sep 08, 2015 at 05:19:56PM +, Jeff Squyres (jsquyres) wrote:
> Thanks Adrian; I turned this into https://github.com/open-mpi/ompi/issues/874.
> 
> > On Sep 8, 2015, at 9:56 AM, Adrian Reber <adr...@lisas.de> wrote:
> > 
> > Since a few days the MTT runs on my ppc64 systems are failing with:
> > 
> > [bimini:11716] *** Process received signal ***
> > [bimini:11716] Signal: Segmentation fault (11)
> > [bimini:11716] Signal code: Address not mapped (1)
> > [bimini:11716] Failing at address: (nil)[bimini:11716] [ 0] [0x3fffa2bb0448]
> > [bimini:11716] [ 1] /lib64/libc.so.6(+0xcb074)[0x3fffa27eb074] 
> > [bimini:11716] [ 2]
> > /home/adrian/mtt-scratch/installs/GubX/install/lib/libpmix.so.0(opal_pmix_pmix1xx_pmix_value_xfer-0x68758)[0x3fffa2158a10]
> >  [bimini:11716] [ 3]
> > /home/adrian/mtt-scratch/installs/GubX/install/lib/libpmix.so.0(OPAL_PMIX_PMIX1XX_PMIx_Put-0x48338)[0x3fffa2179f70]
> >  [bimini:11716] [ 4]
> > /home/adrian/mtt-scratch/installs/GubX/install/lib/openmpi/mca_pmix_pmix1xx.so(pmix1_put-0x27efc)[0x3fffa21d858c]
> > 
> > I think I do not see these kind of errors on any of the other MTT setups
> > so it might be ppc64 related. Just wanted to point it out.
> > 
> > Adrian
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post: 
> > http://www.open-mpi.org/community/lists/devel/2015/09/17979.php
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/09/17981.php


[OMPI devel] MTT failures since the last few days on ppc64

2015-09-08 Thread Adrian Reber
Since a few days the MTT runs on my ppc64 systems are failing with:

[bimini:11716] *** Process received signal ***
[bimini:11716] Signal: Segmentation fault (11)
[bimini:11716] Signal code: Address not mapped (1)
[bimini:11716] Failing at address: (nil)[bimini:11716] [ 0] [0x3fffa2bb0448]
[bimini:11716] [ 1] /lib64/libc.so.6(+0xcb074)[0x3fffa27eb074] [bimini:11716] [ 
2]
/home/adrian/mtt-scratch/installs/GubX/install/lib/libpmix.so.0(opal_pmix_pmix1xx_pmix_value_xfer-0x68758)[0x3fffa2158a10]
 [bimini:11716] [ 3]
/home/adrian/mtt-scratch/installs/GubX/install/lib/libpmix.so.0(OPAL_PMIX_PMIX1XX_PMIx_Put-0x48338)[0x3fffa2179f70]
 [bimini:11716] [ 4]
/home/adrian/mtt-scratch/installs/GubX/install/lib/openmpi/mca_pmix_pmix1xx.so(pmix1_put-0x27efc)[0x3fffa21d858c]

I think I do not see these kind of errors on any of the other MTT setups
so it might be ppc64 related. Just wanted to point it out.

Adrian


Re: [OMPI devel] esslingen MTT?

2015-08-25 Thread Adrian Reber
On Mon, Aug 24, 2015 at 09:47:22PM +, Jeff Squyres (jsquyres) wrote:
> Who runs the esslingen MTT?
> 
> You're getting some build failures on master that I don't understand:
> 
> -
> make[3]: Entering directory
> '/home/adrian/mtt-scratch/mpi-install/FDvh/src/openmpi-dev-2350-geb25c00/ompi/mpi/fortran/mpif-h/profile'
>   GENERATE psizeof_f.f90
>   FC   psizeof_f.lo
> Usage: 
> /home/adrian/mtt-scratch/mpi-install/FDvh/src/openmpi-dev-2350-geb25c00/libtool
>  [OPTION]...
> [MODE-ARG]...
> Try 'libtool --help' for more information.
> Makefile:2609: recipe for target 'psizeof_f.lo' failed
> -
> 
> Can you do a "make V=1" so that I can see what exactly is going wrong?

make[3]: Entering directory 
'/home/adrian/ompi/build/ompi/mpi/fortran/mpif-h/profile'
/bin/sh ../../../../../libtool  --tag=FC   --mode=compile-c -o psizeof_f.lo 
 psizeof_f.f90
libtool: compile: unrecognized option `-c'
libtool: compile: Try `libtool --help' for more information.
Makefile:2598: recipe for target 'psizeof_f.lo' failed
make[3]: *** [psizeof_f.lo] Error 1

The system has no fortran compiler installed and after a

 yum install gcc-gfortran.ppc64

it builds again. So it seems a fortran compiler is now required.

Adrian


Re: [OMPI devel] OBJ_RELEASE() question

2015-02-12 Thread Adrian Reber
I am not 100% sure I was understood correctly and I am also not sure I
understand the discussion I triggered.

Being not very familiar with the Open MPI code base I often look at
other places in the code for examples how something can/could be done.
Looking at different examples OBJ_RELEASE() I see at some places first a
OBJ_RELEASE() and then the buffer is set to NULL.

pcregrep -r -M  'OBJ_RELEASE.*(\n|.).*=(\s)?NULL' *

[...]
ompi/group/group_init.c:OBJ_RELEASE (new_group);
new_group = NULL;
ompi/group/group_init.c:OBJ_RELEASE (new_group);
new_group = NULL;
ompi/group/group_init.c:OBJ_RELEASE(new_group);
new_group = NULL;
ompi/group/group_init.c:OBJ_RELEASE (new_group);
new_group = NULL;
ompi/group/group_init.c:OBJ_RELEASE(new_group);
new_group = NULL;
ompi/group/group_init.c:OBJ_RELEASE(new_group);
new_group = NULL;
[... and many more ...]

That was the reason I was looking at the definition of OBJ_RELEASE() and
I saw it already does set the buffer to NULL. Manually setting it to
NULL could theoretically lead to situation where memory is not correctly
free'd (I have not actually seen it).

My question is more theoretically that setting the buffer to
NULL is not necessary and a bad example?

Adrian

On Thu, Feb 12, 2015 at 12:45:06AM -0800, Ralph Castain wrote:
> It would be good to know where you are seeing this - as was stated, the macro 
> reduces the ref count and will NULL the pointer if and only if the ref count 
> goes to zero. However, the code may set it to NULL for some other reason that 
> relates to the later use of that particular variable.
> 
> If not used properly, however, it can lead to a memory leak. So it’s best 
> that we (a) identify where this was done (I personally don’t recall having 
> seen it), and (b) add comments to the code explaining why it explicitly sets 
> the param to NULL (e.g., the object is tracked elsewhere and will later be 
> free’d).
> 
> 
> > On Feb 12, 2015, at 12:09 AM, Adrian Reber <adr...@lisas.de> wrote:
> > 
> > I was just curious as if I am calling
> > 
> > OBJ_RELEASE(buffer);
> > buffer = NULL;
> > 
> > on a buffer with an object count different to 1, the buffer is not free'd
> > but set to NULL. If I call it again the buffer is NULL and the original
> > buffer will not be free'd. Setting the buffer to NULL seems unnecessary.
> > 
> > I have not seen this as a problem in the code I was just trying to
> > understand if I have to call only
> > 
> > OBJ_RELEASE(buffer);
> > 
> > or
> > 
> > OBJ_RELEASE(buffer);
> > buffer = NULL;
> > 
> > and for me the first variant seems to be the correct one.
> > 
> > Adrian
> > 
> > On Thu, Feb 12, 2015 at 04:58:02PM +0900, Gilles Gouaillardet wrote:
> >> Adrian,
> >> 
> >> opal_obj_update does not fail or success, it returns the new
> >> obj_reference_count.
> >> 
> >> 
> >> can you point to one specific location in the code where you think it is
> >> wrong ?
> >> 
> >> OBJ_RELEASE(buffer)
> >> buffer = NULL;
> >> 
> >> could be written as
> >> 
> >> if (((opal_object_t *)buffer)->obj_reference_count == 1) {
> >>OBJ_RELEASE(buffer);
> >> } else {
> >>buffer = NULL;
> >> }
> >> 
> >> that would never ever set buffer to NULL twice, but would be wrong
> >> since there is no atomicity here
> >> /* that was for for the "unnecessary" part */
> >> 
> >> about the "wrong" part, why do you think the else branch is wrong ?
> >> /* i mean setting a pointer to NULL is not necessarily wrong */
> >> 
> >> Cheers,
> >> 
> >> Gilles
> >> 
> >> 
> >> On 2015/02/12 16:41, Adrian Reber wrote:
> >>> At many places all over the code I see
> >>> 
> >>> OBJ_RELEASE(buffer)
> >>> buffer = NULL;
> >>> 
> >>> Looking at the definition of OBJ_RELEASE() this seems unnecessary and
> >>> wrong:
> >>> 
> >>> #define OBJ_RELEASE(object) \
> >>>do {\
> >>>if (0 == opal_obj_update((opal_object_t *) (object), -1)) { \
> >>>opal_obj_run_destructors((opal_object_t *) (object));   \
> >>>free(object);   \
> >>>object = NULL; 

Re: [OMPI devel] OBJ_RELEASE() question

2015-02-12 Thread Adrian Reber
I was just curious as if I am calling

OBJ_RELEASE(buffer);
buffer = NULL;

on a buffer with an object count different to 1, the buffer is not free'd
but set to NULL. If I call it again the buffer is NULL and the original
buffer will not be free'd. Setting the buffer to NULL seems unnecessary.

I have not seen this as a problem in the code I was just trying to
understand if I have to call only

OBJ_RELEASE(buffer);

or

OBJ_RELEASE(buffer);
buffer = NULL;

and for me the first variant seems to be the correct one.

Adrian

On Thu, Feb 12, 2015 at 04:58:02PM +0900, Gilles Gouaillardet wrote:
> Adrian,
> 
> opal_obj_update does not fail or success, it returns the new
> obj_reference_count.
> 
> 
> can you point to one specific location in the code where you think it is
> wrong ?
> 
> OBJ_RELEASE(buffer)
> buffer = NULL;
> 
> could be written as
> 
> if (((opal_object_t *)buffer)->obj_reference_count == 1) {
> OBJ_RELEASE(buffer);
> } else {
> buffer = NULL;
> }
> 
> that would never ever set buffer to NULL twice, but would be wrong
> since there is no atomicity here
> /* that was for for the "unnecessary" part */
> 
> about the "wrong" part, why do you think the else branch is wrong ?
> /* i mean setting a pointer to NULL is not necessarily wrong */
> 
> Cheers,
> 
> Gilles
> 
> 
> On 2015/02/12 16:41, Adrian Reber wrote:
> > At many places all over the code I see
> >
> > OBJ_RELEASE(buffer)
> > buffer = NULL;
> >
> > Looking at the definition of OBJ_RELEASE() this seems unnecessary and
> > wrong:
> >
> > #define OBJ_RELEASE(object) \
> > do {\
> > if (0 == opal_obj_update((opal_object_t *) (object), -1)) { \
> > opal_obj_run_destructors((opal_object_t *) (object));   \
> > free(object);   \
> > object = NULL;  \
> > }   \
> > } while (0)
> >
> > The object is set to NULL by the macro and only if the opal_obj_update() was
> > successful. So it seems setting the buffer manually to NULL after 
> > OBJ_RELEASE()
> > is unnecessary and if opal_obj_update() failed it also is wrong.
> >
> > Adrian
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post: 
> > http://www.open-mpi.org/community/lists/devel/2015/02/16970.php
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/02/16971.php


[OMPI devel] OBJ_RELEASE() question

2015-02-12 Thread Adrian Reber
At many places all over the code I see

OBJ_RELEASE(buffer)
buffer = NULL;

Looking at the definition of OBJ_RELEASE() this seems unnecessary and
wrong:

#define OBJ_RELEASE(object) \
do {\
if (0 == opal_obj_update((opal_object_t *) (object), -1)) { \
opal_obj_run_destructors((opal_object_t *) (object));   \
free(object);   \
object = NULL;  \
}   \
} while (0)

The object is set to NULL by the macro and only if the opal_obj_update() was
successful. So it seems setting the buffer manually to NULL after OBJ_RELEASE()
is unnecessary and if opal_obj_update() failed it also is wrong.

Adrian


Re: [OMPI devel] Master hangs in opal_LIFO test

2015-02-03 Thread Adrian Reber
There is right now another bug report concerning opal_lifo and ppc64 here:

https://github.com/open-mpi/ompi/issues/371

and there were hangs on ppc64 a few weeks ago in opal_lifo which Nathan
fixed with additional barriers.

On Mon, Feb 02, 2015 at 11:18:43PM -0800, Paul Hargrove wrote:
> CORRECTION:
> 
> It is the opal_lifo (not fifo) test which hung on both systems.
> 
> -Paul
> 
> On Mon, Feb 2, 2015 at 11:03 PM, Paul Hargrove  wrote:
> 
> > I have seen opal_fifo hang on 2 distinct systems
> >  + Linux/ppc32 with xlc-11.1
> >  + Linux/x86-64 with icc-14.0.1.106
> >
> > I have no explanation to offer for either hang.
> > No "weird" configure options were passed to either.
> >
> > -Paul
> >
> > --
> > Paul H. Hargrove  phhargr...@lbl.gov
> > Computer Languages & Systems Software (CLaSS) Group
> > Computer Science Department   Tel: +1-510-495-2352
> > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> >
> 
> 
> 
> -- 
> Paul H. Hargrove  phhargr...@lbl.gov
> Computer Languages & Systems Software (CLaSS) Group
> Computer Science Department   Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900

> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/02/16913.php


Re: [OMPI devel] btl_openib.c:1200: mca_btl_openib_alloc: Assertion `qp != 255' failed

2015-02-02 Thread Adrian Reber
https://github.com/open-mpi/ompi/issues/372

On Sat, Jan 31, 2015 at 01:38:54PM +, Jeff Squyres (jsquyres) wrote:
> Adrian --
> 
> Can you file this as a Github issue?  Thanks.
> 
> 
> > On Jan 17, 2015, at 12:58 PM, Adrian Reber <adr...@lisas.de> wrote:
> > 
> > This time my bug report is not PSM related:
> > 
> > I was able to reproduce the MTT error from 
> > http://mtt.open-mpi.org/index.php?do_redir=2228
> > on my system with openmpi-dev-720-gf4693c9:
> > 
> > mpi_test_suite: btl_openib.c:1200: mca_btl_openib_alloc: Assertion `qp != 
> > 255' failed.
> > [n050409:06796] *** Process received signal ***
> > [n050409:06796] Signal: Aborted (6)
> > [n050409:06796] Signal code:  (-6)
> > [n050409:06796] [ 0] /lib64/libpthread.so.0(+0xf710)[0x2b036d501710]
> > [n050409:06796] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x2b036d741635]
> > [n050409:06796] [ 2] /lib64/libc.so.6(abort+0x175)[0x2b036d742e15]
> > [n050409:06796] [ 3] /lib64/libc.so.6(+0x2b75e)[0x2b036d73a75e]
> > [n050409:06796] [ 4] 
> > /lib64/libc.so.6(__assert_perror_fail+0x0)[0x2b036d73a820]
> > [n050409:06796] [ 5] 
> > /lustre/ws1/ws/adrian-mtt-0/ompi-install/lib/openmpi/mca_btl_openib.so(mca_btl_openib_alloc+0x77)[0x2b03730cf6d0]
> > [n050409:06796] [ 6] 
> > /lustre/ws1/ws/adrian-mtt-0/ompi-install/lib/openmpi/mca_btl_openib.so(mca_btl_openib_sendi+0x5e5)[0x2b03730d1ae9]
> > [n050409:06796] [ 7] 
> > /lustre/ws1/ws/adrian-mtt-0/ompi-install/lib/openmpi/mca_pml_ob1.so(+0xd407)[0x2b0373961407]
> > [n050409:06796] [ 8] 
> > /lustre/ws1/ws/adrian-mtt-0/ompi-install/lib/openmpi/mca_pml_ob1.so(+0xde45)[0x2b0373961e45]
> > [n050409:06796] [ 9] 
> > /lustre/ws1/ws/adrian-mtt-0/ompi-install/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x1ce)[0x2b0373962501]
> > [n050409:06796] [10] 
> > /lustre/ws1/ws/adrian-mtt-0/ompi-install/lib/libmpi.so.0(PMPI_Send+0x2b4)[0x2b036d20d1bb]
> > [n050409:06796] [11] mpi_test_suite[0x464424]
> > [n050409:06796] [12] mpi_test_suite[0x470304]
> > [n050409:06796] [13] mpi_test_suite[0x444a72]
> > [n050409:06796] [14] 
> > /lib64/libc.so.6(__libc_start_main+0xfd)[0x2b036d72dd5d]
> > [n050409:06796] [15] mpi_test_suite[0x4051a9]
> > [n050409:06796] *** End of error message ***
> > --
> > mpirun noticed that process rank 0 with PID 0 on node n050409 exited on 
> > signal 6 (Aborted).
> > --
> > 
> > Core was generated by `mpi_test_suite -t p2p'.
> > Program terminated with signal 6, Aborted.
> > (gdb) bt
> > #0  0x2b036d741635 in raise () from /lib64/libc.so.6
> > #1  0x2b036d742d9d in abort () from /lib64/libc.so.6
> > #2  0x2b036d73a75e in __assert_fail_base () from /lib64/libc.so.6
> > #3  0x2b036d73a820 in __assert_fail () from /lib64/libc.so.6
> > #4  0x2b03730cf6d0 in mca_btl_openib_alloc (btl=0x224e740, 
> > ep=0x22b66a0, order=255 '\377', size=73014, flags=3) at btl_openib.c:1200
> > #5  0x2b03730d1ae9 in mca_btl_openib_sendi (btl=0x224e740, 
> > ep=0x22b66a0, convertor=0x7fff2c527bb0, header=0x7fff2c527cd0, 
> > header_size=14, payload_size=73000, order=255 '\377', flags=3, 
> >tag=65 'A', descriptor=0x7fff2c527ce8) at btl_openib.c:1829
> > #6  0x2b0373961407 in mca_bml_base_sendi (bml_btl=0x2198850, 
> > convertor=0x7fff2c527bb0, header=0x7fff2c527cd0, header_size=14, 
> > payload_size=73000, order=255 '\377', flags=3, tag=65 'A', 
> >descriptor=0x7fff2c527ce8) at ../../../../ompi/mca/bml/bml.h:305
> > #7  0x2b0373961e45 in mca_pml_ob1_send_inline (buf=0x2b7b760, count=1, 
> > datatype=0x2b97440, dst=1, tag=37, seqn=3639, dst_proc=0x21c2940, 
> > endpoint=0x22dff00, comm=0x6939e0) at pml_ob1_isend.c:107
> > #8  0x2b0373962501 in mca_pml_ob1_send (buf=0x2b7b760, count=1, 
> > datatype=0x2b97440, dst=1, tag=37, sendmode=MCA_PML_BASE_SEND_STANDARD, 
> > comm=0x6939e0) at pml_ob1_isend.c:214
> > #9  0x2b036d20d1bb in PMPI_Send (buf=0x2b7b760, count=1, 
> > type=0x2b97440, dest=1, tag=37, comm=0x6939e0) at psend.c:78
> > #10 0x00464424 in tst_p2p_simple_ring_xsend_run 
> > (env=0x7fff2c528530) at p2p/tst_p2p_simple_ring_xsend.c:97
> > #11 0x00470304 in tst_test_run_func (env=0x7fff2c528530) at 
> > tst_tests.c:1463
> > #12 0x00444a72 in main (argc=3, argv=0x7fff2c5287f8) at 
> > mpi_test_suite.c:639
> > 
> > This is with --enable-debug. Without --enable-debug I get a
> > segmentation fault, but not always. Using fewer cores it work

Re: [OMPI devel] RFC: Remove embedded libltdl

2015-02-02 Thread Adrian Reber
I have reported the same error a few days ago and submitted it now as a
github issue: https://github.com/open-mpi/ompi/issues/371

On Mon, Feb 02, 2015 at 12:36:54PM +1100, Christopher Samuel wrote:
> On 31/01/15 10:51, Jeff Squyres (jsquyres) wrote:
> 
> > New tarball posted (same location).  Now featuring 100% fewer "make check" 
> > failures.
> 
> On our BG/Q front-end node (PPC64, RHEL 6.4) I see:
> 
> ../../config/test-driver: line 95: 30173 Segmentation fault  (core 
> dumped) "$@" > $log_file 2>&1
> FAIL: opal_lifo
> 
> Stack trace implies the culprit is in:
> 
> #0  0x10001048 in opal_atomic_swap_32 (addr=0x20, newval=1)
> at 
> /vlsci/VLSCI/samuel/tmp/OMPI/openmpi-gitclone/opal/include/opal/sys/atomic_impl.h:51
> 51  old = *addr;
> 
> I've attached a script of gdb doing "thread apply all bt full" in
> case that's helpful.
> 
> All the best,
> Chris
> -- 
>  Christopher SamuelSenior Systems Administrator
>  VLSCI - Victorian Life Sciences Computation Initiative
>  Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
>  http://www.vlsci.org.au/  http://twitter.com/vlsci
> 

> Script started on Mon 02 Feb 2015 12:32:56 EST
> 
> [samuel@avoca class]$ gdb 
> /vlsci/VLSCI/samuel/tmp/OMPI/build-gcc/test/class/.libs/lt-opal_lifo 
> core.32444
> [?1034hGNU gdb (GDB) Red Hat Enterprise Linux (7.2-60.el6_4.1)
> Copyright (C) 2010 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later 
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
> and "show warranty" for details.
> This GDB was configured as "ppc64-redhat-linux-gnu".
> For bug reporting instructions, please see:
> ...
> Reading symbols from 
> /vlsci/VLSCI/samuel/tmp/OMPI/build-gcc/test/class/.libs/lt-opal_lifo...done.
> [New Thread 32465]
> [New Thread 32464]
> [New Thread 32466]
> [New Thread 32444]
> [New Thread 32469]
> [New Thread 32467]
> [New Thread 32470]
> [New Thread 32463]
> [New Thread 32468]
> Missing separate debuginfo for 
> /vlsci/VLSCI/samuel/tmp/OMPI/build-gcc/opal/.libs/libopen-pal.so.0
> Try: yum --disablerepo='*' --enablerepo='*-debug*' install 
> /usr/lib/debug/.build-id/de/a09192aa84bbc15579ae5190dc8acd16eb94fe
> Missing separate debuginfo for /usr/local/slurm/14.03.10/lib/libpmi.so.0
> Try: yum --disablerepo='*' --enablerepo='*-debug*' install 
> /usr/lib/debug/.build-id/28/09dfc4706ed44259cc31a5898c8d1a9b76b949
> Missing separate debuginfo for /usr/local/slurm/14.03.10/lib/libslurm.so.27
> Try: yum --disablerepo='*' --enablerepo='*-debug*' install 
> /usr/lib/debug/.build-id/e2/39d8a2994ae061ab7ada0ebb7719b8efa5de96
> Missing separate debuginfo for 
> Try: yum --disablerepo='*' --enablerepo='*-debug*' install 
> /usr/lib/debug/.build-id/1a/063e3d64bb5560021ec2ba5329fb1e420b470f
> Reading symbols from 
> /vlsci/VLSCI/samuel/tmp/OMPI/build-gcc/opal/.libs/libopen-pal.so.0...done.
> Loaded symbols for 
> /vlsci/VLSCI/samuel/tmp/OMPI/build-gcc/opal/.libs/libopen-pal.so.0
> Reading symbols from /usr/local/slurm/14.03.10/lib/libpmi.so.0...done.
> Loaded symbols for /usr/local/slurm/14.03.10/lib/libpmi.so.0
> Reading symbols from /usr/local/slurm/14.03.10/lib/libslurm.so.27...done.
> Loaded symbols for /usr/local/slurm/14.03.10/lib/libslurm.so.27
> Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done.
> Loaded symbols for /lib64/libdl.so.2
> Reading symbols from /lib64/libpthread.so.0...(no debugging symbols 
> found)...done.
> [Thread debugging using libthread_db enabled]
> Loaded symbols for /lib64/libpthread.so.0
> Reading symbols from /lib64/librt.so.1...(no debugging symbols found)...done.
> Loaded symbols for /lib64/librt.so.1
> Reading symbols from /lib64/libm.so.6...(no debugging symbols found)...done.
> Loaded symbols for /lib64/libm.so.6
> Reading symbols from /lib64/libutil.so.1...(no debugging symbols 
> found)...done.
> Loaded symbols for /lib64/libutil.so.1
> Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
> Loaded symbols for /lib64/libc.so.6
> Reading symbols from /lib64/ld64.so.1...(no debugging symbols found)...done.
> Loaded symbols for /lib64/ld64.so.1
> Core was generated by 
> `/vlsci/VLSCI/samuel/tmp/OMPI/build-gcc/test/class/.libs/lt-opal_lifo '.
> Program terminated with signal 11, Segmentation fault.
> #0  0x10001048 in opal_atomic_swap_32 (addr=0x20, newval=1)
> at 
> /vlsci/VLSCI/samuel/tmp/OMPI/openmpi-gitclone/opal/include/opal/sys/atomic_impl.h:51
> 51old = *addr;
> Missing separate debuginfos, use: debuginfo-install 
> glibc-2.12-1.107.el6_4.5.ppc64
> (gdb) thread apply all bt full
> 
> Thread 9 (Thread 0xfff7a0ef200 (LWP 32468)):
> #0  0x0080adb6629c in .__libc_write () from /lib64/libpthread.so.0
> No symbol table info available.
> #1  0x0fff7d6905b4 in show_stackframe (signo=11, 

[OMPI devel] make check failure on ppc64

2015-01-25 Thread Adrian Reber
Tonight's MTT on ppc64 has following failure:

http://mtt.open-mpi.org/index.php?do_redir=2229

This is not easily reproducible but I have the core file from that
segfault:

Core was generated by 
`/home/adrian/mtt-scratch/mpi-install/QMjb/src/openmpi-dev-750-gff7be58/test/cla'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x10001438 in opal_atomic_swap_32 (addr=0x38, newval=1) at 
../../opal/include/opal/sys/atomic_impl.h:51
51  old = *addr;
Missing separate debuginfos, use: debuginfo-install glibc-2.20-5.fc21.ppc64
(gdb) bt
#0  0x10001438 in opal_atomic_swap_32 (addr=0x38, newval=1) at 
../../opal/include/opal/sys/atomic_impl.h:51
#1  0x100018b8 in opal_lifo_pop_atomic (lifo=0x3fffd6ed2da0) at 
../../opal/class/opal_lifo.h:193
#2  0x10001a9c in thread_test (arg=0x3fffd6ed2da0) at opal_lifo.c:50
#3  0x3fff98a7ba64 in .start_thread () from /lib64/libpthread.so.0
#4  0x3fff989a09b4 in .__clone () from /lib64/libc.so.6

Adrian


Re: [OMPI devel] btl_openib.c:1200: mca_btl_openib_alloc: Assertion `qp != 255' failed

2015-01-20 Thread Adrian Reber
Using today's nightly snapshot (openmpi-dev-730-g06d3b57) both errors
are gone. Thanks!

On Mon, Jan 19, 2015 at 02:38:42PM +0900, Gilles Gouaillardet wrote:
> Adrian,
> 
> about the
> "[n050409][[36216,1],1][btl_openib_xrc.c:58:mca_btl_openib_xrc_check_api] XRC
> error: bad XRC API (require XRC from OFED pre 3.12). " message.
> 
> this means ompi was built on a system with OFED 3.12 or greater, and you
> are running on a system with an earlier OFED release.
> 
> please not Jeff recently pushed a patch related to that and this message
> might be a false positive.
> 
> Cheers,
> 
> Gilles
> 
> On 2015/01/19 14:17, Gilles Gouaillardet wrote:
> > Adrian,
> >
> > i just fixed this in the master
> > (https://github.com/open-mpi/ompi/commit/d14daf40d041f7a0a8e9d85b3bfd5eb570495fd2)
> >
> > the root cause is a corner case was not handled correctly :
> >
> > MPI_Type_hvector(2, 1, 0, MPI_INT, );
> >
> > type has extent = 4 *but* size = 8
> > ob1 used to test only the extent to determine whether the message should
> > be sent inlined or not
> > extent <= 256 means try to send the message inline
> > that meant a fragment of size 8 (which is greater than 65536 e.g.
> > max default size for IB) was allocated,
> > and that failed.
> >
> > now both extent and size are tested, so the message is not sent inline,
> > and it just works.
> >
> > Cheers,
> >
> > Gilles
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post: 
> > http://www.open-mpi.org/community/lists/devel/2015/01/16798.php
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/01/16799.php


[OMPI devel] btl_openib.c:1200: mca_btl_openib_alloc: Assertion `qp != 255' failed

2015-01-17 Thread Adrian Reber
This time my bug report is not PSM related:

I was able to reproduce the MTT error from 
http://mtt.open-mpi.org/index.php?do_redir=2228
on my system with openmpi-dev-720-gf4693c9:

mpi_test_suite: btl_openib.c:1200: mca_btl_openib_alloc: Assertion `qp != 255' 
failed.
[n050409:06796] *** Process received signal ***
[n050409:06796] Signal: Aborted (6)
[n050409:06796] Signal code:  (-6)
[n050409:06796] [ 0] /lib64/libpthread.so.0(+0xf710)[0x2b036d501710]
[n050409:06796] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x2b036d741635]
[n050409:06796] [ 2] /lib64/libc.so.6(abort+0x175)[0x2b036d742e15]
[n050409:06796] [ 3] /lib64/libc.so.6(+0x2b75e)[0x2b036d73a75e]
[n050409:06796] [ 4] /lib64/libc.so.6(__assert_perror_fail+0x0)[0x2b036d73a820]
[n050409:06796] [ 5] 
/lustre/ws1/ws/adrian-mtt-0/ompi-install/lib/openmpi/mca_btl_openib.so(mca_btl_openib_alloc+0x77)[0x2b03730cf6d0]
[n050409:06796] [ 6] 
/lustre/ws1/ws/adrian-mtt-0/ompi-install/lib/openmpi/mca_btl_openib.so(mca_btl_openib_sendi+0x5e5)[0x2b03730d1ae9]
[n050409:06796] [ 7] 
/lustre/ws1/ws/adrian-mtt-0/ompi-install/lib/openmpi/mca_pml_ob1.so(+0xd407)[0x2b0373961407]
[n050409:06796] [ 8] 
/lustre/ws1/ws/adrian-mtt-0/ompi-install/lib/openmpi/mca_pml_ob1.so(+0xde45)[0x2b0373961e45]
[n050409:06796] [ 9] 
/lustre/ws1/ws/adrian-mtt-0/ompi-install/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x1ce)[0x2b0373962501]
[n050409:06796] [10] 
/lustre/ws1/ws/adrian-mtt-0/ompi-install/lib/libmpi.so.0(PMPI_Send+0x2b4)[0x2b036d20d1bb]
[n050409:06796] [11] mpi_test_suite[0x464424]
[n050409:06796] [12] mpi_test_suite[0x470304]
[n050409:06796] [13] mpi_test_suite[0x444a72]
[n050409:06796] [14] /lib64/libc.so.6(__libc_start_main+0xfd)[0x2b036d72dd5d]
[n050409:06796] [15] mpi_test_suite[0x4051a9]
[n050409:06796] *** End of error message ***
--
mpirun noticed that process rank 0 with PID 0 on node n050409 exited on signal 
6 (Aborted).
--

Core was generated by `mpi_test_suite -t p2p'.
Program terminated with signal 6, Aborted.
(gdb) bt
#0  0x2b036d741635 in raise () from /lib64/libc.so.6
#1  0x2b036d742d9d in abort () from /lib64/libc.so.6
#2  0x2b036d73a75e in __assert_fail_base () from /lib64/libc.so.6
#3  0x2b036d73a820 in __assert_fail () from /lib64/libc.so.6
#4  0x2b03730cf6d0 in mca_btl_openib_alloc (btl=0x224e740, ep=0x22b66a0, 
order=255 '\377', size=73014, flags=3) at btl_openib.c:1200
#5  0x2b03730d1ae9 in mca_btl_openib_sendi (btl=0x224e740, ep=0x22b66a0, 
convertor=0x7fff2c527bb0, header=0x7fff2c527cd0, header_size=14, 
payload_size=73000, order=255 '\377', flags=3, 
tag=65 'A', descriptor=0x7fff2c527ce8) at btl_openib.c:1829
#6  0x2b0373961407 in mca_bml_base_sendi (bml_btl=0x2198850, 
convertor=0x7fff2c527bb0, header=0x7fff2c527cd0, header_size=14, 
payload_size=73000, order=255 '\377', flags=3, tag=65 'A', 
descriptor=0x7fff2c527ce8) at ../../../../ompi/mca/bml/bml.h:305
#7  0x2b0373961e45 in mca_pml_ob1_send_inline (buf=0x2b7b760, count=1, 
datatype=0x2b97440, dst=1, tag=37, seqn=3639, dst_proc=0x21c2940, 
endpoint=0x22dff00, comm=0x6939e0) at pml_ob1_isend.c:107
#8  0x2b0373962501 in mca_pml_ob1_send (buf=0x2b7b760, count=1, 
datatype=0x2b97440, dst=1, tag=37, sendmode=MCA_PML_BASE_SEND_STANDARD, 
comm=0x6939e0) at pml_ob1_isend.c:214
#9  0x2b036d20d1bb in PMPI_Send (buf=0x2b7b760, count=1, type=0x2b97440, 
dest=1, tag=37, comm=0x6939e0) at psend.c:78
#10 0x00464424 in tst_p2p_simple_ring_xsend_run (env=0x7fff2c528530) at 
p2p/tst_p2p_simple_ring_xsend.c:97
#11 0x00470304 in tst_test_run_func (env=0x7fff2c528530) at 
tst_tests.c:1463
#12 0x00444a72 in main (argc=3, argv=0x7fff2c5287f8) at 
mpi_test_suite.c:639

This is with --enable-debug. Without --enable-debug I get a
segmentation fault, but not always. Using fewer cores it works most
of the time. With 32 cores on 4 nodes it happens almost
all the time. If it does not crash using fewer cores I get messages like:

[n050409][[36216,1],1][btl_openib_xrc.c:58:mca_btl_openib_xrc_check_api] XRC 
error: bad XRC API (require XRC from OFED pre 3.12).

Adrian


Re: [OMPI devel] Another Open MPI <-> PSM question (MPI_Isend()/MPI_Cancel())

2015-01-16 Thread Adrian Reber
See my comment on https://github.com/open-mpi/ompi/issues/347

On Thu, Jan 15, 2015 at 05:01:00PM -0500, George Bosilca wrote:
> Skimming through the PSM code shows that the return values of the PSM
> functions are handled in most cases. Thus, removing the default error
> handler might not be such a bad idea.
> 
> Did you experience any trouble running with the version without the default
> error handler registered?
> 
>   George.
> 
> 
> On Thu, Jan 15, 2015 at 4:40 PM, Adrian Reber <adr...@lisas.de> wrote:
> 
> > It even says so in the code:
> >
> > ompi/mca/mtl/psm/mtl_psm.c:
> >
> >/* Default error handling is enabled, errors will not be returned to
> >  * user.  PSM prints the error and the offending endpoint's
> > hostname
> >  * and exits with -1 */
> >
> > Disabling the default PSM error handler makes MPI_Cancel() fail
> > gracefully. But then no error is handled anymore.
> >
> > Adrian
> >
> > On Thu, Jan 15, 2015 at 10:21:05PM +0100, Adrian Reber wrote:
> > > As PSM on master is still broken I applied it on 1.8.4. Unfortunately it
> > > does not work. The error is the same as before.
> > >
> > > Looking at your patch I would also expect that this is the correct fix
> > > and I even tried to change ompi_mtl_psm_cancel() to always return
> > > OMPI_SUCCESS. MPI_Cancel() still fails.
> > >
> > > Looking at the PSM code it seems it can directly call exit(-1) and thus
> > > terminating and never returning to Open MPI. I do not see any debug
> > > output from Open MPI after "Cannot cancel send requests" from PSM.
> > >
> > >   Adrian
> > >
> > > On Thu, Jan 15, 2015 at 01:43:11PM -0500, George Bosilca wrote:
> > > > >From the MPI standard perspective MPI_Cancel doesn't have to succeed,
> > it
> > > > can also gracefully fail. However, the PSM MTL diverges from the MPI
> > > > standard and if a request cannot be canceled an error is returned.
> > Here is
> > > > a patch to fix this issue.
> > > >
> > > > diff --git a/ompi/mca/mtl/psm/mtl_psm_cancel.c
> > > > b/ompi/mca/mtl/psm/mtl_psm_cancel
> > > > index 6da3386..277c761 100644
> > > > --- a/ompi/mca/mtl/psm/mtl_psm_cancel.c
> > > > +++ b/ompi/mca/mtl/psm/mtl_psm_cancel.c
> > > > @@ -37,10 +37,8 @@ int ompi_mtl_psm_cancel(struct
> > mca_mtl_base_module_t*
> > > > mtl,
> > > >  if(PSM_OK == err) {
> > > >    mtl_request->ompi_req->req_status._cancelled = true;
> > > >
> > mtl_psm_request->super.completion_callback(_psm_request->super);
> > > > -  return OMPI_SUCCESS;
> > > > -} else {
> > > > -  return OMPI_ERROR;
> > > >  }
> > > > +return OMPI_SUCCESS;
> > > >} else if(PSM_MQ_INCOMPLETE == err) {
> > > >  return OMPI_SUCCESS;
> > > >}
> > > >
> > > >   George.
> > > >
> > > >
> > > > On Thu, Jan 15, 2015 at 1:30 PM, Adrian Reber <adr...@lisas.de> wrote:
> > > >
> > > > > Doing
> > > > >
> > > > > MPI_Isend()
> > > > >
> > > > > followed by a
> > > > >
> > > > > MPI_Cancel()
> > > > >
> > > > > fails on my PSM based system with 1.8.4 like this:
> > > > >
> > > > > n040108:0.1.Cannot cancel send requests (req=0x2b6279787f80)
> > > > > n040108:0.0.Cannot cancel send requests (req=0x2b3a3dc92f80)
> > > > > ---
> > > > > Primary job  terminated normally, but 1 process returned
> > > > > a non-zero exit code.. Per user-direction, the job has been aborted.
> > > > > ---
> > > > >
> > --
> > > > > mpirun detected that one or more processes exited with non-zero
> > status,
> > > > > thus causing
> > > > > the job to be terminated. The first process to do so was:
> > > > >
> > > > >   Process name: [[58364,1],1]
> > > > >   Exit code:255
> > > > >
> > --
> > > > >
> > > > > Is this something PSM actually cannot do or an Open MPI error?
> > > > >
> > > > > Adrian
> > > > > ___
> > > > > devel mailing list
> > > > > de...@open-mpi.org
> > > > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > > > Link to this post:
> > > > > http://www.open-mpi.org/community/lists/devel/2015/01/16783.php
> > > > >
> > >
> > > > ___
> > > > devel mailing list
> > > > de...@open-mpi.org
> > > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > > Link to this post:
> > http://www.open-mpi.org/community/lists/devel/2015/01/16784.php
> > > ___
> > > devel mailing list
> > > de...@open-mpi.org
> > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > Link to this post:
> > http://www.open-mpi.org/community/lists/devel/2015/01/16786.php


Re: [OMPI devel] Another Open MPI <-> PSM question (MPI_Isend()/MPI_Cancel())

2015-01-15 Thread Adrian Reber
It even says so in the code:

ompi/mca/mtl/psm/mtl_psm.c:

   /* Default error handling is enabled, errors will not be returned to
 * user.  PSM prints the error and the offending endpoint's hostname
 * and exits with -1 */

Disabling the default PSM error handler makes MPI_Cancel() fail
gracefully. But then no error is handled anymore.

Adrian

On Thu, Jan 15, 2015 at 10:21:05PM +0100, Adrian Reber wrote:
> As PSM on master is still broken I applied it on 1.8.4. Unfortunately it
> does not work. The error is the same as before.
> 
> Looking at your patch I would also expect that this is the correct fix
> and I even tried to change ompi_mtl_psm_cancel() to always return
> OMPI_SUCCESS. MPI_Cancel() still fails.
> 
> Looking at the PSM code it seems it can directly call exit(-1) and thus
> terminating and never returning to Open MPI. I do not see any debug
> output from Open MPI after "Cannot cancel send requests" from PSM.
> 
>   Adrian
> 
> On Thu, Jan 15, 2015 at 01:43:11PM -0500, George Bosilca wrote:
> > >From the MPI standard perspective MPI_Cancel doesn't have to succeed, it
> > can also gracefully fail. However, the PSM MTL diverges from the MPI
> > standard and if a request cannot be canceled an error is returned. Here is
> > a patch to fix this issue.
> > 
> > diff --git a/ompi/mca/mtl/psm/mtl_psm_cancel.c
> > b/ompi/mca/mtl/psm/mtl_psm_cancel
> > index 6da3386..277c761 100644
> > --- a/ompi/mca/mtl/psm/mtl_psm_cancel.c
> > +++ b/ompi/mca/mtl/psm/mtl_psm_cancel.c
> > @@ -37,10 +37,8 @@ int ompi_mtl_psm_cancel(struct mca_mtl_base_module_t*
> > mtl,
> >  if(PSM_OK == err) {
> >mtl_request->ompi_req->req_status._cancelled = true;
> >mtl_psm_request->super.completion_callback(_psm_request->super);
> > -  return OMPI_SUCCESS;
> > -} else {
> > -  return OMPI_ERROR;
> >  }
> > +return OMPI_SUCCESS;
> >} else if(PSM_MQ_INCOMPLETE == err) {
> >  return OMPI_SUCCESS;
> >}
> > 
> >   George.
> > 
> > 
> > On Thu, Jan 15, 2015 at 1:30 PM, Adrian Reber <adr...@lisas.de> wrote:
> > 
> > > Doing
> > >
> > > MPI_Isend()
> > >
> > > followed by a
> > >
> > > MPI_Cancel()
> > >
> > > fails on my PSM based system with 1.8.4 like this:
> > >
> > > n040108:0.1.Cannot cancel send requests (req=0x2b6279787f80)
> > > n040108:0.0.Cannot cancel send requests (req=0x2b3a3dc92f80)
> > > ---
> > > Primary job  terminated normally, but 1 process returned
> > > a non-zero exit code.. Per user-direction, the job has been aborted.
> > > ---
> > > --
> > > mpirun detected that one or more processes exited with non-zero status,
> > > thus causing
> > > the job to be terminated. The first process to do so was:
> > >
> > >   Process name: [[58364,1],1]
> > >   Exit code:255
> > > --
> > >
> > > Is this something PSM actually cannot do or an Open MPI error?
> > >
> > > Adrian
> > > ___
> > > devel mailing list
> > > de...@open-mpi.org
> > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > Link to this post:
> > > http://www.open-mpi.org/community/lists/devel/2015/01/16783.php
> > >
> 
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post: 
> > http://www.open-mpi.org/community/lists/devel/2015/01/16784.php
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/01/16786.php

Adrian

-- 
Adrian Reber <adr...@lisas.de>http://lisas.de/~adrian/
C-3PO: 
Don't call me a mindless philosopher, you overweight
glob of grease!


Re: [OMPI devel] Another Open MPI <-> PSM question (MPI_Isend()/MPI_Cancel())

2015-01-15 Thread Adrian Reber
As PSM on master is still broken I applied it on 1.8.4. Unfortunately it
does not work. The error is the same as before.

Looking at your patch I would also expect that this is the correct fix
and I even tried to change ompi_mtl_psm_cancel() to always return
OMPI_SUCCESS. MPI_Cancel() still fails.

Looking at the PSM code it seems it can directly call exit(-1) and thus
terminating and never returning to Open MPI. I do not see any debug
output from Open MPI after "Cannot cancel send requests" from PSM.

Adrian

On Thu, Jan 15, 2015 at 01:43:11PM -0500, George Bosilca wrote:
> >From the MPI standard perspective MPI_Cancel doesn't have to succeed, it
> can also gracefully fail. However, the PSM MTL diverges from the MPI
> standard and if a request cannot be canceled an error is returned. Here is
> a patch to fix this issue.
> 
> diff --git a/ompi/mca/mtl/psm/mtl_psm_cancel.c
> b/ompi/mca/mtl/psm/mtl_psm_cancel
> index 6da3386..277c761 100644
> --- a/ompi/mca/mtl/psm/mtl_psm_cancel.c
> +++ b/ompi/mca/mtl/psm/mtl_psm_cancel.c
> @@ -37,10 +37,8 @@ int ompi_mtl_psm_cancel(struct mca_mtl_base_module_t*
> mtl,
>  if(PSM_OK == err) {
>mtl_request->ompi_req->req_status._cancelled = true;
>mtl_psm_request->super.completion_callback(_psm_request->super);
> -  return OMPI_SUCCESS;
> -} else {
> -  return OMPI_ERROR;
>  }
> +return OMPI_SUCCESS;
>} else if(PSM_MQ_INCOMPLETE == err) {
>  return OMPI_SUCCESS;
>}
> 
>   George.
> 
> 
> On Thu, Jan 15, 2015 at 1:30 PM, Adrian Reber <adr...@lisas.de> wrote:
> 
> > Doing
> >
> > MPI_Isend()
> >
> > followed by a
> >
> > MPI_Cancel()
> >
> > fails on my PSM based system with 1.8.4 like this:
> >
> > n040108:0.1.Cannot cancel send requests (req=0x2b6279787f80)
> > n040108:0.0.Cannot cancel send requests (req=0x2b3a3dc92f80)
> > ---
> > Primary job  terminated normally, but 1 process returned
> > a non-zero exit code.. Per user-direction, the job has been aborted.
> > ---
> > --
> > mpirun detected that one or more processes exited with non-zero status,
> > thus causing
> > the job to be terminated. The first process to do so was:
> >
> >   Process name: [[58364,1],1]
> >   Exit code:255
> > --
> >
> > Is this something PSM actually cannot do or an Open MPI error?
> >
> > Adrian
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post:
> > http://www.open-mpi.org/community/lists/devel/2015/01/16783.php
> >

> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/01/16784.php


[OMPI devel] Another Open MPI <-> PSM question (MPI_Isend()/MPI_Cancel())

2015-01-15 Thread Adrian Reber
Doing 

MPI_Isend()

followed by a

MPI_Cancel()

fails on my PSM based system with 1.8.4 like this:

n040108:0.1.Cannot cancel send requests (req=0x2b6279787f80)
n040108:0.0.Cannot cancel send requests (req=0x2b3a3dc92f80)
---
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
---
--
mpirun detected that one or more processes exited with non-zero status,
thus causing
the job to be terminated. The first process to do so was:

  Process name: [[58364,1],1]
  Exit code:255
--

Is this something PSM actually cannot do or an Open MPI error?

Adrian


Re: [OMPI devel] Changed behaviour with PSM on master

2015-01-09 Thread Adrian Reber
Should I still open a ticket? Will these be changed or do I always have
to provide '--mca mtl psm' in the future?

On Fri, Jan 09, 2015 at 12:27:59PM -0700, Howard Pritchard wrote:
> HI Adrian, Andrew,
> 
> Sorry try again,  both the libfabric psm provider and the open mpi psm
> mtl are trying to use psm_init.
> 
> So, to avoid this problem, add
> 
> --mca mtl psm
> 
> to your mpirun command line.
> 
> Sorry for the confusion.
> 
> Howard
> 
> 
> 2015-01-09 7:52 GMT-07:00 Friedley, Andrew <andrew.fried...@intel.com>:
> 
> > No this is not expected behavior.
> >
> > The PSM MTL code has not changed in 2 months, when I fixed that unused
> > variable warning for you.  That suggests something above the PSM MTL broke
> > things.  I see no reason your older software install should suddenly
> > stopping working if all you are updating is OMPI master -- at least with
> > respect to PSM anyway.
> >
> > The error message is right, it's not possible to open more than one
> > context per process.  This hasn't changed.  It does indicate that maybe
> > something is causing the MTL to be opened twice in each process?
> >
> > Andrew
> >
> > > -Original Message-
> > > From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Adrian
> > > Reber
> > > Sent: Friday, January 9, 2015 4:13 AM
> > > To: de...@open-mpi.org
> > > Subject: [OMPI devel] Changed behaviour with PSM on master
> > >
> > > Running the mpi_test_suite on master used to work with no problems. At
> > > some point in time it stopped working however and now I get only error
> > > messages from PSM:
> > >
> > > """
> > > n050301:3.0.In PSM version 1.14, it is not possible to open more than
> > one
> > > context per process
> > >
> > > [n050301:26526] Open MPI detected an unexpected PSM error in opening an
> > > endpoint: In PSM version 1.14, it is not possible to open more than one
> > > context per process """
> > >
> > > I know that I do not have the newest version of the PSM library and that
> > I
> > > need to update the library but as this requires many software packages
> > to be
> > > re-compiled we are trying to avoid it on our CentOS6 based system.
> > >
> > > My main question (probably for Andrew) is if this is an expected
> > behaviour
> > > on master. It works on 1.8.x and it used to work on master at least
> > until 2014-
> > > 12-08.
> > >
> > > This is the last MTT entry for working PSM (with my older version)
> > > http://mtt.open-mpi.org/index.php?do_redir=2226
> > >
> > > and since a few days it fails on master
> > > http://mtt.open-mpi.org/index.php?do_redir=2225
> > >
> > > On another system (RHEL7) with newer PSM libraries there is no such
> > error.
> > >
> > >   Adrian
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post:
> > http://www.open-mpi.org/community/lists/devel/2015/01/16766.php
> >

> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/01/16769.php


Re: [OMPI devel] test/class/opal_fifo failure on ppc64

2015-01-09 Thread Adrian Reber
Thanks. mtt on my ppc64 system is happy again.

On Thu, Jan 08, 2015 at 09:16:43AM -0700, Nathan Hjelm wrote:
> 
> Fixed on master. I forgot a write memory barrier in the 64-bit version
> of opal_fifo_pop_atomic.
> 
> -Nathan
> 
> On Thu, Jan 08, 2015 at 02:29:05PM +0100, Adrian Reber wrote:
> > I am trying to build OMPI git master on ppc64 (PPC970MP) and
> > test/class/opal_fifo fails during make check most of the time.
> > 
> > [adrian@bimini class]$ ./opal_fifo
> > Single thread test. Time: 0 s 99714 us 99 nsec/poppush
> > Atomics thread finished. Time: 0 s 347577 us 347 nsec/poppush
> > Atomics thread finished. Time: 11 s 490743 us 11490 nsec/poppush
> > Atomics thread finished. Time: 11 s 567542 us 11567 nsec/poppush
> > Atomics thread finished. Time: 11 s 655924 us 11655 nsec/poppush
> > Atomics thread finished. Time: 11 s 786925 us 11786 nsec/poppush
> > Atomics thread finished. Time: 11 s 931230 us 11931 nsec/poppush
> > Atomics thread finished. Time: 12 s 11617 us 12011 nsec/poppush
> > Atomics thread finished. Time: 12 s 63224 us 12063 nsec/poppush
> > Atomics thread finished. Time: 12 s 65844 us 12065 nsec/poppush
> >  Failure :  fifo push/pop multi-threaded with atomics
> > All threads finished. Thread count: 8 Time: 12 s 66103 us 1508 nsec/poppush
> > Exhaustive atomics thread finished. Popped 11982 items. Time: 3 s 700703 us 
> > 308855 nsec/poppush
> > Exhaustive atomics thread finished. Popped 12171 items. Time: 3 s 759974 us 
> > 308928 nsec/poppush
> > Exhaustive atomics thread finished. Popped 11593 items. Time: 3 s 787227 us 
> > 326682 nsec/poppush
> > Exhaustive atomics thread finished. Popped 11079 items. Time: 3 s 786468 us 
> > 341769 nsec/poppush
> > Exhaustive atomics thread finished. Popped 16467 items. Time: 4 s 7891 us 
> > 243389 nsec/poppush
> > Exhaustive atomics thread finished. Popped 11097 items. Time: 4 s 68897 us 
> > 36 nsec/poppush
> > Exhaustive atomics thread finished. Popped 25583 items. Time: 4 s 89074 us 
> > 159835 nsec/poppush
> > Exhaustive atomics thread finished. Popped 22092 items. Time: 4 s 82373 us 
> > 184789 nsec/poppush
> >  Failure :  fifo push/pop multi-threaded with atomics when there are 
> > insufficient items
> > All threads finished. Thread count: 8 Time: 4 s 93369 us 511 nsec/poppush
> >  Failure :  fifo pop all items
> > SUPPORT: OMPI Test failed: opal_fifo_t (3 of 8 failed)
> > 
> > I had a look at the memory barriers in 
> > opal/include/opal/sys/powerpc/atomic.h
> > and from what little I remember about PPC64 those look correct:
> > 
> > #define MB()  __asm__ __volatile__ ("sync" : : : "memory")
> > #define RMB() __asm__ __volatile__ ("lwsync" : : : "memory")
> > #define WMB() __asm__ __volatile__ ("eieio" : : : "memory")
> > 
> > The system is running Fedora 21 with gcc 4.9.2 and if this platform
> > is still relevant I can provide SSH access to the machine
> > for further debugging.
> > 
> > Adrian
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post: 
> > http://www.open-mpi.org/community/lists/devel/2015/01/16760.php



> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/01/16762.php


pgpbtrbB6LSB9.pgp
Description: PGP signature


[OMPI devel] test/class/opal_fifo failure on ppc64

2015-01-08 Thread Adrian Reber
I am trying to build OMPI git master on ppc64 (PPC970MP) and
test/class/opal_fifo fails during make check most of the time.

[adrian@bimini class]$ ./opal_fifo
Single thread test. Time: 0 s 99714 us 99 nsec/poppush
Atomics thread finished. Time: 0 s 347577 us 347 nsec/poppush
Atomics thread finished. Time: 11 s 490743 us 11490 nsec/poppush
Atomics thread finished. Time: 11 s 567542 us 11567 nsec/poppush
Atomics thread finished. Time: 11 s 655924 us 11655 nsec/poppush
Atomics thread finished. Time: 11 s 786925 us 11786 nsec/poppush
Atomics thread finished. Time: 11 s 931230 us 11931 nsec/poppush
Atomics thread finished. Time: 12 s 11617 us 12011 nsec/poppush
Atomics thread finished. Time: 12 s 63224 us 12063 nsec/poppush
Atomics thread finished. Time: 12 s 65844 us 12065 nsec/poppush
 Failure :  fifo push/pop multi-threaded with atomics
All threads finished. Thread count: 8 Time: 12 s 66103 us 1508 nsec/poppush
Exhaustive atomics thread finished. Popped 11982 items. Time: 3 s 700703 us 
308855 nsec/poppush
Exhaustive atomics thread finished. Popped 12171 items. Time: 3 s 759974 us 
308928 nsec/poppush
Exhaustive atomics thread finished. Popped 11593 items. Time: 3 s 787227 us 
326682 nsec/poppush
Exhaustive atomics thread finished. Popped 11079 items. Time: 3 s 786468 us 
341769 nsec/poppush
Exhaustive atomics thread finished. Popped 16467 items. Time: 4 s 7891 us 
243389 nsec/poppush
Exhaustive atomics thread finished. Popped 11097 items. Time: 4 s 68897 us 
36 nsec/poppush
Exhaustive atomics thread finished. Popped 25583 items. Time: 4 s 89074 us 
159835 nsec/poppush
Exhaustive atomics thread finished. Popped 22092 items. Time: 4 s 82373 us 
184789 nsec/poppush
 Failure :  fifo push/pop multi-threaded with atomics when there are 
insufficient items
All threads finished. Thread count: 8 Time: 4 s 93369 us 511 nsec/poppush
 Failure :  fifo pop all items
SUPPORT: OMPI Test failed: opal_fifo_t (3 of 8 failed)

I had a look at the memory barriers in opal/include/opal/sys/powerpc/atomic.h
and from what little I remember about PPC64 those look correct:

#define MB()  __asm__ __volatile__ ("sync" : : : "memory")
#define RMB() __asm__ __volatile__ ("lwsync" : : : "memory")
#define WMB() __asm__ __volatile__ ("eieio" : : : "memory")

The system is running Fedora 21 with gcc 4.9.2 and if this platform
is still relevant I can provide SSH access to the machine
for further debugging.

Adrian


[OMPI devel] FT code (again)

2014-12-19 Thread Adrian Reber
Again I am trying to get the FT code working. This time I am unsure how
to resolve the code changes from this commit:

commit aec5cd08bd8c33677276612b899b48618d271efa
Author: Ralph Castain 
List-Post: devel@lists.open-mpi.org
Date:   Thu Aug 21 18:56:47 2014 +

Per the PMIx RFC:


This includes changes like this:


@@ -172,17 +164,7 @@ static int rte_init(void)
  * in the job won't be executing this step, so we would hang
  */
 if (ORTE_PROC_IS_NON_MPI && !orte_do_not_barrier) {
-orte_grpcomm_collective_t coll;
-OBJ_CONSTRUCT(, orte_grpcomm_collective_t);
-coll.id = orte_process_info.peer_modex;
-coll.active = true;
-if (ORTE_SUCCESS != (ret = orte_grpcomm.modex())) {
-ORTE_ERROR_LOG(ret);
-error = "orte modex";
-goto error;
-}
-ORTE_WAIT_FOR_COMPLETION(coll.active);
-OBJ_DESTRUCT();
+opal_pmix.fence(NULL, 0);
 }


In the FT code in orte/mca/ess/env/ess_env_module.c there is similar code:

OBJ_CONSTRUCT(, orte_grpcomm_collective_t);
coll.id = orte_process_info.snapc_init_barrier;

...

if (ORTE_SUCCESS != (ret = orte_grpcomm.barrier())) {

...

coll.active = true;
ORTE_WAIT_FOR_COMPLETION(coll.active);


How can this be expressed with the new code?


Adrian


Re: [OMPI devel] 1.8.4rc4 now out for testing

2014-12-15 Thread Adrian Reber
1.8.4rc4 works without errors on my PSM based systems.

Adrian

On Sat, Dec 13, 2014 at 03:06:07PM -0800, Ralph Castain wrote:
> Hi folks
> 
> I’ve rolled up the bug fixes so far, including the thread-multiple 
> performance fix. So please give this one a whirl
> 
> http://www.open-mpi.org/software/ompi/v1.8/ 
> 
> 
> Ralph
> 

> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/12/16586.php


Re: [OMPI devel] 1.8.3 and PSM errors

2014-11-13 Thread Adrian Reber
I applied the fix committed on master and described in

https://github.com/open-mpi/ompi/issues/268

on 1.8.3 and 1.8.4rc1 and this seems to have fixed my problems. I can
include my PSM based mtt results in the main mtt database if desired.

Adrian


On Tue, Nov 11, 2014 at 07:42:24PM +0100, Adrian Reber wrote:
> Using the intel test suite I can reproduce it for example with:
> 
> $ mpirun --np 2 --map-by ppr:1:node   `pwd`/src/MPI_Allgatherv_c
> MPITEST info  (0): Starting MPI_Allgatherv() test
> MPITEST info  (0): Node spec MPITEST_comm_sizes[6]=2 too large, using 1
> MPITEST info  (0): Node spec MPITEST_comm_sizes[22]=2 too large, using 1
> MPITEST info  (0): Node spec MPITEST_comm_sizes[32]=2 too large, using 1
> 
> MPI_Allgatherv_c:9230 terminated with signal 11 at PC=7fc4ced4b150 
> SP=7fff45aa2fb0.  Backtrace:
> /lib64/libpsm_infinipath.so.1(ips_proto_connect+0x630)[0x7fc4ced4b150]
> /lib64/libpsm_infinipath.so.1(ips_ptl_connect+0x3a)[0x7fc4ced4219a]
> /lib64/libpsm_infinipath.so.1(__psm_ep_connect+0x3e7)[0x7fc4ced3a727]
> /opt/bwhpc/common/mpi/openmpi/1.8.4-gnu-4.8/lib/libmpi.so.1(ompi_mtl_psm_add_procs+0x1f3)[0x7fc4cf902303]
> /opt/bwhpc/common/mpi/openmpi/1.8.4-gnu-4.8/lib/libmpi.so.1(ompi_comm_get_rprocs+0x49a)[0x7fc4cf7cbc2a]
> /opt/bwhpc/common/mpi/openmpi/1.8.4-gnu-4.8/lib/libmpi.so.1(PMPI_Intercomm_create+0x2f2)[0x7fc4cf7fb602]
> /lustre/lxfs/work/ws/es_test01-open_mpi-0/ompi-tests/intel_tests/src/MPI_Allgatherv_c[0x40f5bf]
> /lustre/lxfs/work/ws/es_test01-open_mpi-0/ompi-tests/intel_tests/src/MPI_Allgatherv_c[0x40edf4]
> /lustre/lxfs/work/ws/es_test01-open_mpi-0/ompi-tests/intel_tests/src/MPI_Allgatherv_c[0x401c80]
> /lib64/libc.so.6(__libc_start_main+0xf5)[0x7fc4cf1a8af5]
> /lustre/lxfs/work/ws/es_test01-open_mpi-0/ompi-tests/intel_tests/src/MPI_Allgatherv_c[0x401a89]
> ---
> Primary job  terminated normally, but 1 process returned
> a non-zero exit code.. Per user-direction, the job has been aborted.
> ---
> 
> 
> On Tue, Nov 11, 2014 at 10:26:52AM -0800, Ralph Castain wrote:
> > I think it would help understand this if you isolated it down to a single 
> > test that is failing, rather than just citing an entire test suite. For 
> > example, we know that the many-to-one test is never going to pass, 
> > regardless of transport. We also know that the dynamic tests will fail with 
> > PSM as they are not supported by that transport.
> > 
> > So could you find one test that doesn’t pass, and give us some info on that 
> > one?
> > 
> > 
> > > On Nov 11, 2014, at 10:04 AM, Adrian Reber <adr...@lisas.de> wrote:
> > > 
> > > Some more information about our PSM troubles.
> > > 
> > > Using 1.6.5 the test suite still works. It fails with 1.8.3 and
> > > 1.8.4rc1. As long as all processes are running on one node it also
> > > works. As soon as one process is running on a second node it fails with
> > > the previously described errors. I also tried the 1.8 release and it has
> > > the same error. Another way to trigger it with only two processes is:
> > > 
> > > mpirun --np 2 --map-by ppr:1:node   mpi_test_suite -t "environment"
> > > 
> > > Some change introduced between 1.6.5 and 1.8 broke this test case with
> > > PSM. I have not yet been able to upgrade PSM to 3.3 but it seems more
> > > Open MPI related than PSM.
> > > 
> > > Intel MPI (4.1.1) has also no troubles running the test cases.
> > > 
> > >   Adrian
> > > 
> > > On Mon, Nov 10, 2014 at 09:12:41PM +, Friedley, Andrew wrote:
> > >> Hi Adrian,
> > >> 
> > >> Yes, I suggest trying either RH support or Intel's support at  
> > >> ibsupp...@intel.com.  They might have seen this problem before.  Since 
> > >> you're running the RHEL versions of PSM and related software, one thing 
> > >> you could try is IFS.  I think I was running IFS 7.3.0, so that's a 
> > >> difference between your setup and mine.  At the least, it may help 
> > >> support nail down the issue.
> > >> 
> > >> Andrew
> > >> 
> > >>> -Original Message-
> > >>> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Adrian
> > >>> Reber
> > >>> Sent: Monday, November 10, 2014 12:39 PM
> > >>> To: Open MPI Developers
> > >>> Subject: Re: [OMPI devel] 1.8.3 and PSM errors
> > >>> 
> > >>> Andrew,
> > >>> 
> > >

Re: [OMPI devel] 1.8.3 and PSM errors

2014-11-11 Thread Adrian Reber
Using the intel test suite I can reproduce it for example with:

$ mpirun --np 2 --map-by ppr:1:node   `pwd`/src/MPI_Allgatherv_c
MPITEST info  (0): Starting MPI_Allgatherv() test
MPITEST info  (0): Node spec MPITEST_comm_sizes[6]=2 too large, using 1
MPITEST info  (0): Node spec MPITEST_comm_sizes[22]=2 too large, using 1
MPITEST info  (0): Node spec MPITEST_comm_sizes[32]=2 too large, using 1

MPI_Allgatherv_c:9230 terminated with signal 11 at PC=7fc4ced4b150 
SP=7fff45aa2fb0.  Backtrace:
/lib64/libpsm_infinipath.so.1(ips_proto_connect+0x630)[0x7fc4ced4b150]
/lib64/libpsm_infinipath.so.1(ips_ptl_connect+0x3a)[0x7fc4ced4219a]
/lib64/libpsm_infinipath.so.1(__psm_ep_connect+0x3e7)[0x7fc4ced3a727]
/opt/bwhpc/common/mpi/openmpi/1.8.4-gnu-4.8/lib/libmpi.so.1(ompi_mtl_psm_add_procs+0x1f3)[0x7fc4cf902303]
/opt/bwhpc/common/mpi/openmpi/1.8.4-gnu-4.8/lib/libmpi.so.1(ompi_comm_get_rprocs+0x49a)[0x7fc4cf7cbc2a]
/opt/bwhpc/common/mpi/openmpi/1.8.4-gnu-4.8/lib/libmpi.so.1(PMPI_Intercomm_create+0x2f2)[0x7fc4cf7fb602]
/lustre/lxfs/work/ws/es_test01-open_mpi-0/ompi-tests/intel_tests/src/MPI_Allgatherv_c[0x40f5bf]
/lustre/lxfs/work/ws/es_test01-open_mpi-0/ompi-tests/intel_tests/src/MPI_Allgatherv_c[0x40edf4]
/lustre/lxfs/work/ws/es_test01-open_mpi-0/ompi-tests/intel_tests/src/MPI_Allgatherv_c[0x401c80]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7fc4cf1a8af5]
/lustre/lxfs/work/ws/es_test01-open_mpi-0/ompi-tests/intel_tests/src/MPI_Allgatherv_c[0x401a89]
---
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
---


On Tue, Nov 11, 2014 at 10:26:52AM -0800, Ralph Castain wrote:
> I think it would help understand this if you isolated it down to a single 
> test that is failing, rather than just citing an entire test suite. For 
> example, we know that the many-to-one test is never going to pass, regardless 
> of transport. We also know that the dynamic tests will fail with PSM as they 
> are not supported by that transport.
> 
> So could you find one test that doesn’t pass, and give us some info on that 
> one?
> 
> 
> > On Nov 11, 2014, at 10:04 AM, Adrian Reber <adr...@lisas.de> wrote:
> > 
> > Some more information about our PSM troubles.
> > 
> > Using 1.6.5 the test suite still works. It fails with 1.8.3 and
> > 1.8.4rc1. As long as all processes are running on one node it also
> > works. As soon as one process is running on a second node it fails with
> > the previously described errors. I also tried the 1.8 release and it has
> > the same error. Another way to trigger it with only two processes is:
> > 
> > mpirun --np 2 --map-by ppr:1:node   mpi_test_suite -t "environment"
> > 
> > Some change introduced between 1.6.5 and 1.8 broke this test case with
> > PSM. I have not yet been able to upgrade PSM to 3.3 but it seems more
> > Open MPI related than PSM.
> > 
> > Intel MPI (4.1.1) has also no troubles running the test cases.
> > 
> > Adrian
> > 
> > On Mon, Nov 10, 2014 at 09:12:41PM +, Friedley, Andrew wrote:
> >> Hi Adrian,
> >> 
> >> Yes, I suggest trying either RH support or Intel's support at  
> >> ibsupp...@intel.com.  They might have seen this problem before.  Since 
> >> you're running the RHEL versions of PSM and related software, one thing 
> >> you could try is IFS.  I think I was running IFS 7.3.0, so that's a 
> >> difference between your setup and mine.  At the least, it may help support 
> >> nail down the issue.
> >> 
> >> Andrew
> >> 
> >>> -Original Message-
> >>> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Adrian
> >>> Reber
> >>> Sent: Monday, November 10, 2014 12:39 PM
> >>> To: Open MPI Developers
> >>> Subject: Re: [OMPI devel] 1.8.3 and PSM errors
> >>> 
> >>> Andrew,
> >>> 
> >>> thanks for looking into this. I was able to reproduce this error on RHEL 
> >>> 7 with
> >>> PSM provided by RHEL:
> >>> 
> >>> infinipath-psm-3.2-2_ga8c3e3e_open.2.el7.x86_64
> >>> infinipath-psm-devel-3.2-2_ga8c3e3e_open.2.el7.x86_64
> >>> 
> >>> $ mpirun -np 32 mpi_test_suite -t "environment"
> >>> 
> >>> mpi_test_suite:4877 terminated with signal 11 at PC=7f5a2f4a2150
> >>> SP=7fff9e0ce770.  Backtrace:
> >>> /lib64/libpsm_infinipath.so.1(ips_proto_connect+0x630)[0x7f5a2f4a2150]
> >>> /lib64/libpsm_infinipath.so.1(ips_ptl_connect+0x3a)[0x7f5a2f49919a]
> >

Re: [OMPI devel] 1.8.3 and PSM errors

2014-11-11 Thread Adrian Reber
Some more information about our PSM troubles.

Using 1.6.5 the test suite still works. It fails with 1.8.3 and
1.8.4rc1. As long as all processes are running on one node it also
works. As soon as one process is running on a second node it fails with
the previously described errors. I also tried the 1.8 release and it has
the same error. Another way to trigger it with only two processes is:

mpirun --np 2 --map-by ppr:1:node   mpi_test_suite -t "environment"

Some change introduced between 1.6.5 and 1.8 broke this test case with
PSM. I have not yet been able to upgrade PSM to 3.3 but it seems more
Open MPI related than PSM.

Intel MPI (4.1.1) has also no troubles running the test cases.

Adrian

On Mon, Nov 10, 2014 at 09:12:41PM +, Friedley, Andrew wrote:
> Hi Adrian,
> 
> Yes, I suggest trying either RH support or Intel's support at  
> ibsupp...@intel.com.  They might have seen this problem before.  Since you're 
> running the RHEL versions of PSM and related software, one thing you could 
> try is IFS.  I think I was running IFS 7.3.0, so that's a difference between 
> your setup and mine.  At the least, it may help support nail down the issue.
> 
> Andrew
> 
> > -Original Message-
> > From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Adrian
> > Reber
> > Sent: Monday, November 10, 2014 12:39 PM
> > To: Open MPI Developers
> > Subject: Re: [OMPI devel] 1.8.3 and PSM errors
> > 
> > Andrew,
> > 
> > thanks for looking into this. I was able to reproduce this error on RHEL 7 
> > with
> > PSM provided by RHEL:
> > 
> > infinipath-psm-3.2-2_ga8c3e3e_open.2.el7.x86_64
> > infinipath-psm-devel-3.2-2_ga8c3e3e_open.2.el7.x86_64
> > 
> > $ mpirun -np 32 mpi_test_suite -t "environment"
> > 
> > mpi_test_suite:4877 terminated with signal 11 at PC=7f5a2f4a2150
> > SP=7fff9e0ce770.  Backtrace:
> > /lib64/libpsm_infinipath.so.1(ips_proto_connect+0x630)[0x7f5a2f4a2150]
> > /lib64/libpsm_infinipath.so.1(ips_ptl_connect+0x3a)[0x7f5a2f49919a]
> > /lib64/libpsm_infinipath.so.1(__psm_ep_connect+0x3e7)[0x7f5a2f491727]
> > /opt/bwhpc/common/mpi/openmpi/1.8.3-gnu-
> > 4.8/lib/libmpi.so.1(ompi_mtl_psm_add_procs+0x1f3)[0x7f5a30054cf3]
> > /opt/bwhpc/common/mpi/openmpi/1.8.3-gnu-
> > 4.8/lib/libmpi.so.1(ompi_comm_get_rprocs+0x49a)[0x7f5a2ff221da]
> > /opt/bwhpc/common/mpi/openmpi/1.8.3-gnu-
> > 4.8/lib/libmpi.so.1(PMPI_Intercomm_create+0x2f2)[0x7f5a2ff51832]
> > mpi_test_suite[0x469420]
> > mpi_test_suite[0x441d8e]
> > /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f5a2f8ffaf5]
> > mpi_test_suite[0x405349]
> > 
> > Source RPM  : infinipath-psm-3.2-2_ga8c3e3e_open.2.el7.src.rpm
> > Build Date  : Tue 04 Mar 2014 02:45:41 AM CET Build Host  : x86-
> > 025.build.eng.bos.redhat.com Relocations : /usr
> > Packager: Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>
> > Vendor  : Red Hat, Inc.
> > URL : 
> > http://www.openfabrics.org/downloads/infinipath-psm/infinipath-
> > psm-3.2-2_ga8c3e3e_open.tar.gz
> > Summary : QLogic PSM Libraries
> > 
> > Is this supposed to work? Or is this something Red Hat has to fix?
> > 
> > Adrian
> > 
> > On Mon, Oct 27, 2014 at 10:22:08PM +, Friedley, Andrew wrote:
> > > Hi Adrian,
> > >
> > > I'm unable to reproduce here with OMPI v1.8.3 (I assume you're doing this
> > with one 8-core node):
> > >
> > > $ mpirun -np 32 -mca pml cm -mca mtl psm ./mpi_test_suite -t
> > "environment"
> > > (Rank:0) tst_test_array[0]:Status
> > > (Rank:0) tst_test_array[1]:Request_Null
> > > (Rank:0) tst_test_array[2]:Type_dup
> > > (Rank:0) tst_test_array[3]:Get_version Number of failed tests:0
> > >
> > > Works with various np from 8 to 32.  Your original case:
> > >
> > > $ mpirun -np 32 ./mpi_test_suite -t "All,^io,^one-sided"
> > >
> > > Runs for a while and eventually hits send cancellation errors.
> > >
> > > Any chance you could try updating your infinipath libraries?
> > >
> > > Andrew
> > >
> > > > -Original Message-
> > > > From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Adrian
> > > > Reber
> > > > Sent: Monday, October 27, 2014 9:11 AM
> > > > To: Open MPI Developers
> > > > Subject: Re: [OMPI devel] 1.8.3 and PSM errors
> > > >
> > > > This is a simpler test setup:
> > > >
> > > > On 8 core machines this works:
> > > >
> > >

Re: [OMPI devel] 1.8.3 and PSM errors

2014-11-10 Thread Adrian Reber
What is IFS?

On Mon, Nov 10, 2014 at 09:12:41PM +, Friedley, Andrew wrote:
> Hi Adrian,
> 
> Yes, I suggest trying either RH support or Intel's support at  
> ibsupp...@intel.com.  They might have seen this problem before.  Since you're 
> running the RHEL versions of PSM and related software, one thing you could 
> try is IFS.  I think I was running IFS 7.3.0, so that's a difference between 
> your setup and mine.  At the least, it may help support nail down the issue.
> 
> Andrew
> 
> > -Original Message-
> > From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Adrian
> > Reber
> > Sent: Monday, November 10, 2014 12:39 PM
> > To: Open MPI Developers
> > Subject: Re: [OMPI devel] 1.8.3 and PSM errors
> > 
> > Andrew,
> > 
> > thanks for looking into this. I was able to reproduce this error on RHEL 7 
> > with
> > PSM provided by RHEL:
> > 
> > infinipath-psm-3.2-2_ga8c3e3e_open.2.el7.x86_64
> > infinipath-psm-devel-3.2-2_ga8c3e3e_open.2.el7.x86_64
> > 
> > $ mpirun -np 32 mpi_test_suite -t "environment"
> > 
> > mpi_test_suite:4877 terminated with signal 11 at PC=7f5a2f4a2150
> > SP=7fff9e0ce770.  Backtrace:
> > /lib64/libpsm_infinipath.so.1(ips_proto_connect+0x630)[0x7f5a2f4a2150]
> > /lib64/libpsm_infinipath.so.1(ips_ptl_connect+0x3a)[0x7f5a2f49919a]
> > /lib64/libpsm_infinipath.so.1(__psm_ep_connect+0x3e7)[0x7f5a2f491727]
> > /opt/bwhpc/common/mpi/openmpi/1.8.3-gnu-
> > 4.8/lib/libmpi.so.1(ompi_mtl_psm_add_procs+0x1f3)[0x7f5a30054cf3]
> > /opt/bwhpc/common/mpi/openmpi/1.8.3-gnu-
> > 4.8/lib/libmpi.so.1(ompi_comm_get_rprocs+0x49a)[0x7f5a2ff221da]
> > /opt/bwhpc/common/mpi/openmpi/1.8.3-gnu-
> > 4.8/lib/libmpi.so.1(PMPI_Intercomm_create+0x2f2)[0x7f5a2ff51832]
> > mpi_test_suite[0x469420]
> > mpi_test_suite[0x441d8e]
> > /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f5a2f8ffaf5]
> > mpi_test_suite[0x405349]
> > 
> > Source RPM  : infinipath-psm-3.2-2_ga8c3e3e_open.2.el7.src.rpm
> > Build Date  : Tue 04 Mar 2014 02:45:41 AM CET Build Host  : x86-
> > 025.build.eng.bos.redhat.com Relocations : /usr
> > Packager: Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>
> > Vendor  : Red Hat, Inc.
> > URL : 
> > http://www.openfabrics.org/downloads/infinipath-psm/infinipath-
> > psm-3.2-2_ga8c3e3e_open.tar.gz
> > Summary : QLogic PSM Libraries
> > 
> > Is this supposed to work? Or is this something Red Hat has to fix?
> > 
> > Adrian
> > 
> > On Mon, Oct 27, 2014 at 10:22:08PM +, Friedley, Andrew wrote:
> > > Hi Adrian,
> > >
> > > I'm unable to reproduce here with OMPI v1.8.3 (I assume you're doing this
> > with one 8-core node):
> > >
> > > $ mpirun -np 32 -mca pml cm -mca mtl psm ./mpi_test_suite -t
> > "environment"
> > > (Rank:0) tst_test_array[0]:Status
> > > (Rank:0) tst_test_array[1]:Request_Null
> > > (Rank:0) tst_test_array[2]:Type_dup
> > > (Rank:0) tst_test_array[3]:Get_version Number of failed tests:0
> > >
> > > Works with various np from 8 to 32.  Your original case:
> > >
> > > $ mpirun -np 32 ./mpi_test_suite -t "All,^io,^one-sided"
> > >
> > > Runs for a while and eventually hits send cancellation errors.
> > >
> > > Any chance you could try updating your infinipath libraries?
> > >
> > > Andrew
> > >
> > > > -Original Message-
> > > > From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Adrian
> > > > Reber
> > > > Sent: Monday, October 27, 2014 9:11 AM
> > > > To: Open MPI Developers
> > > > Subject: Re: [OMPI devel] 1.8.3 and PSM errors
> > > >
> > > > This is a simpler test setup:
> > > >
> > > > On 8 core machines this works:
> > > >
> > > > $ mpirun  -np 8  mpi_test_suite -t "environment"
> > > > [...]
> > > > Number of failed tests:0
> > > >
> > > > Using 9 or more cores it fails:
> > > >
> > > > $ mpirun  -np 9  mpi_test_suite -t "environment"
> > > >
> > > > mpi_test_suite:20293 terminated with signal 11 at PC=2b6d107fa9a4
> > > > SP=7fff06431a70.  Backtrace:
> > > > /usr/lib64/libpsm_infinipath.so.1(ips_proto_connect+0x334)[0x2b6d107
> > > > fa9a
> > > > 4]
> > > >
> > /usr/lib64/libpsm_infinipath.so.1(__psm_ep_connect+0x692)[0x2b6d107e
> > &g

Re: [OMPI devel] 1.8.3 and PSM errors

2014-11-10 Thread Adrian Reber
Andrew,

thanks for looking into this. I was able to reproduce this error on RHEL 7
with PSM provided by RHEL:

infinipath-psm-3.2-2_ga8c3e3e_open.2.el7.x86_64
infinipath-psm-devel-3.2-2_ga8c3e3e_open.2.el7.x86_64

$ mpirun -np 32 mpi_test_suite -t "environment" 

mpi_test_suite:4877 terminated with signal 11 at PC=7f5a2f4a2150
SP=7fff9e0ce770.  Backtrace:
/lib64/libpsm_infinipath.so.1(ips_proto_connect+0x630)[0x7f5a2f4a2150]
/lib64/libpsm_infinipath.so.1(ips_ptl_connect+0x3a)[0x7f5a2f49919a]
/lib64/libpsm_infinipath.so.1(__psm_ep_connect+0x3e7)[0x7f5a2f491727]
/opt/bwhpc/common/mpi/openmpi/1.8.3-gnu-4.8/lib/libmpi.so.1(ompi_mtl_psm_add_procs+0x1f3)[0x7f5a30054cf3]
/opt/bwhpc/common/mpi/openmpi/1.8.3-gnu-4.8/lib/libmpi.so.1(ompi_comm_get_rprocs+0x49a)[0x7f5a2ff221da]
/opt/bwhpc/common/mpi/openmpi/1.8.3-gnu-4.8/lib/libmpi.so.1(PMPI_Intercomm_create+0x2f2)[0x7f5a2ff51832]
mpi_test_suite[0x469420]
mpi_test_suite[0x441d8e]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f5a2f8ffaf5]
mpi_test_suite[0x405349]

Source RPM  : infinipath-psm-3.2-2_ga8c3e3e_open.2.el7.src.rpm
Build Date  : Tue 04 Mar 2014 02:45:41 AM CET
Build Host  : x86-025.build.eng.bos.redhat.com
Relocations : /usr 
Packager: Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>
Vendor  : Red Hat, Inc.
URL : 
http://www.openfabrics.org/downloads/infinipath-psm/infinipath-psm-3.2-2_ga8c3e3e_open.tar.gz
Summary : QLogic PSM Libraries

Is this supposed to work? Or is this something Red Hat has to fix?

Adrian

On Mon, Oct 27, 2014 at 10:22:08PM +, Friedley, Andrew wrote:
> Hi Adrian,
> 
> I'm unable to reproduce here with OMPI v1.8.3 (I assume you're doing this 
> with one 8-core node):
> 
> $ mpirun -np 32 -mca pml cm -mca mtl psm ./mpi_test_suite -t "environment"
> (Rank:0) tst_test_array[0]:Status
> (Rank:0) tst_test_array[1]:Request_Null
> (Rank:0) tst_test_array[2]:Type_dup
> (Rank:0) tst_test_array[3]:Get_version
> Number of failed tests:0
> 
> Works with various np from 8 to 32.  Your original case:
> 
> $ mpirun -np 32 ./mpi_test_suite -t "All,^io,^one-sided"
> 
> Runs for a while and eventually hits send cancellation errors.
> 
> Any chance you could try updating your infinipath libraries?
> 
> Andrew
> 
> > -Original Message-
> > From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Adrian
> > Reber
> > Sent: Monday, October 27, 2014 9:11 AM
> > To: Open MPI Developers
> > Subject: Re: [OMPI devel] 1.8.3 and PSM errors
> > 
> > This is a simpler test setup:
> > 
> > On 8 core machines this works:
> > 
> > $ mpirun  -np 8  mpi_test_suite -t "environment"
> > [...]
> > Number of failed tests:0
> > 
> > Using 9 or more cores it fails:
> > 
> > $ mpirun  -np 9  mpi_test_suite -t "environment"
> > 
> > mpi_test_suite:20293 terminated with signal 11 at PC=2b6d107fa9a4
> > SP=7fff06431a70.  Backtrace:
> > /usr/lib64/libpsm_infinipath.so.1(ips_proto_connect+0x334)[0x2b6d107fa9a
> > 4]
> > /usr/lib64/libpsm_infinipath.so.1(__psm_ep_connect+0x692)[0x2b6d107eb1
> > 72]
> > /opt/bwhpc/common/mpi/openmpi/1.8.3-gnu-
> > 4.9/lib/libmpi.so.1(ompi_mtl_psm_add_procs+0x1a4)[0x2b6d0fa6e384]
> > /opt/bwhpc/common/mpi/openmpi/1.8.3-gnu-
> > 4.9/lib/libmpi.so.1(ompi_comm_get_rprocs+0x2fa)[0x2b6d0f93376a]
> > /opt/bwhpc/common/mpi/openmpi/1.8.3-gnu-
> > 4.9/lib/libmpi.so.1(MPI_Intercomm_create+0x332)[0x2b6d0f963d42]
> > mpi_test_suite[0x46cd00]
> > mpi_test_suite[0x44434c]
> > /lib64/libc.so.6(__libc_start_main+0xfd)[0x2b6d10047d5d]
> > mpi_test_suite[0x4058e9]
> > ---
> > Primary job  terminated normally, but 1 process returned a non-zero exit
> > code.. Per user-direction, the job has been aborted.
> > ---
> > 
> > mpi_test_suite:11212 terminated with signal 11 at PC=2b2c27d0d9a4
> > SP=75020430.  Backtrace:
> > /usr/lib64/libpsm_infinipath.so.1(ips_proto_connect+0x334)[0x2b2c27d0d9a
> > 4]
> > /usr/lib64/libpsm_infinipath.so.1(__psm_ep_connect+0x692)[0x2b2c27cfe17
> > 2]
> > /opt/bwhpc/common/mpi/openmpi/1.8.3-gnu-
> > 4.9/lib/libmpi.so.1(ompi_mtl_psm_add_procs+0x1a4)[0x2b2c26f81384]
> > /opt/bwhpc/common/mpi/openmpi/1.8.3-gnu-
> > 4.9/lib/libmpi.so.1(ompi_comm_get_rprocs+0x2fa)[0x2b2c26e4676a]
> > /opt/bwhpc/common/mpi/openmpi/1.8.3-gnu-
> > 4.9/lib/libmpi.so.1(MPI_Intercomm_create+0x332)[0x2b2c26e76d42]
> > mpi_test_suite[0x46cd00]
> > mpi_test_suite[0x44434c]
> > /lib64/libc.so.6(__libc_start_main+0xfd)[0x2b2c2755ad5d]
> > mpi_test_suite[0x4058e9]
>

Re: [OMPI devel] 1.8.3 and PSM errors

2014-10-28 Thread Adrian Reber
Good to know. I will update the infinipath libraries on the next
occasion and report back. This will probably take a few days (or weeks).

Adrian

On Mon, Oct 27, 2014 at 10:22:08PM +, Friedley, Andrew wrote:
> Hi Adrian,
> 
> I'm unable to reproduce here with OMPI v1.8.3 (I assume you're doing this 
> with one 8-core node):
> 
> $ mpirun -np 32 -mca pml cm -mca mtl psm ./mpi_test_suite -t "environment"
> (Rank:0) tst_test_array[0]:Status
> (Rank:0) tst_test_array[1]:Request_Null
> (Rank:0) tst_test_array[2]:Type_dup
> (Rank:0) tst_test_array[3]:Get_version
> Number of failed tests:0
> 
> Works with various np from 8 to 32.  Your original case:
> 
> $ mpirun -np 32 ./mpi_test_suite -t "All,^io,^one-sided"
> 
> Runs for a while and eventually hits send cancellation errors.
> 
> Any chance you could try updating your infinipath libraries?
> 
> Andrew
> 
> > -Original Message-
> > From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Adrian
> > Reber
> > Sent: Monday, October 27, 2014 9:11 AM
> > To: Open MPI Developers
> > Subject: Re: [OMPI devel] 1.8.3 and PSM errors
> > 
> > This is a simpler test setup:
> > 
> > On 8 core machines this works:
> > 
> > $ mpirun  -np 8  mpi_test_suite -t "environment"
> > [...]
> > Number of failed tests:0
> > 
> > Using 9 or more cores it fails:
> > 
> > $ mpirun  -np 9  mpi_test_suite -t "environment"
> > 
> > mpi_test_suite:20293 terminated with signal 11 at PC=2b6d107fa9a4
> > SP=7fff06431a70.  Backtrace:
> > /usr/lib64/libpsm_infinipath.so.1(ips_proto_connect+0x334)[0x2b6d107fa9a
> > 4]
> > /usr/lib64/libpsm_infinipath.so.1(__psm_ep_connect+0x692)[0x2b6d107eb1
> > 72]
> > /opt/bwhpc/common/mpi/openmpi/1.8.3-gnu-
> > 4.9/lib/libmpi.so.1(ompi_mtl_psm_add_procs+0x1a4)[0x2b6d0fa6e384]
> > /opt/bwhpc/common/mpi/openmpi/1.8.3-gnu-
> > 4.9/lib/libmpi.so.1(ompi_comm_get_rprocs+0x2fa)[0x2b6d0f93376a]
> > /opt/bwhpc/common/mpi/openmpi/1.8.3-gnu-
> > 4.9/lib/libmpi.so.1(MPI_Intercomm_create+0x332)[0x2b6d0f963d42]
> > mpi_test_suite[0x46cd00]
> > mpi_test_suite[0x44434c]
> > /lib64/libc.so.6(__libc_start_main+0xfd)[0x2b6d10047d5d]
> > mpi_test_suite[0x4058e9]
> > ---
> > Primary job  terminated normally, but 1 process returned a non-zero exit
> > code.. Per user-direction, the job has been aborted.
> > ---
> > 
> > mpi_test_suite:11212 terminated with signal 11 at PC=2b2c27d0d9a4
> > SP=75020430.  Backtrace:
> > /usr/lib64/libpsm_infinipath.so.1(ips_proto_connect+0x334)[0x2b2c27d0d9a
> > 4]
> > /usr/lib64/libpsm_infinipath.so.1(__psm_ep_connect+0x692)[0x2b2c27cfe17
> > 2]
> > /opt/bwhpc/common/mpi/openmpi/1.8.3-gnu-
> > 4.9/lib/libmpi.so.1(ompi_mtl_psm_add_procs+0x1a4)[0x2b2c26f81384]
> > /opt/bwhpc/common/mpi/openmpi/1.8.3-gnu-
> > 4.9/lib/libmpi.so.1(ompi_comm_get_rprocs+0x2fa)[0x2b2c26e4676a]
> > /opt/bwhpc/common/mpi/openmpi/1.8.3-gnu-
> > 4.9/lib/libmpi.so.1(MPI_Intercomm_create+0x332)[0x2b2c26e76d42]
> > mpi_test_suite[0x46cd00]
> > mpi_test_suite[0x44434c]
> > /lib64/libc.so.6(__libc_start_main+0xfd)[0x2b2c2755ad5d]
> > mpi_test_suite[0x4058e9]
> > --
> > mpirun detected that one or more processes exited with non-zero status,
> > thus causing the job to be terminated. The first process to do so was:
> > 
> >   Process name: [[47415,1],0]
> >   Exit code:1
> > ------
> > 
> > 
> > 
> > On Mon, Oct 27, 2014 at 08:27:17AM -0700, Ralph Castain wrote:
> > > I’m afraid I can’t quite decipher from all this what actually fails. Of 
> > > course,
> > PSM doesn’t support dynamic operations like comm_spawn or
> > connect_accept, so if you are running those tests that just won’t work. Is
> > that the heart of the problem here?
> > >
> > >
> > > > On Oct 27, 2014, at 1:40 AM, Adrian Reber <adr...@lisas.de> wrote:
> > > >
> > > > Running Open MPI 1.8.3 with PSM does not seem to work right now at all.
> > > > I am getting the same errors also on trunk from my newly set up MTT.
> > > > Before trying to debug this I just wanted to make sure this is not a
> > > > configuration error. I have following PSM packages installed:
> > > >
> > > > infini

Re: [OMPI devel] 1.8.3 and PSM errors

2014-10-27 Thread Adrian Reber
This is a simpler test setup:

On 8 core machines this works:

$ mpirun  -np 8  mpi_test_suite -t "environment"
[...]
Number of failed tests:0

Using 9 or more cores it fails:

$ mpirun  -np 9  mpi_test_suite -t "environment"

mpi_test_suite:20293 terminated with signal 11 at PC=2b6d107fa9a4 
SP=7fff06431a70.  Backtrace:
/usr/lib64/libpsm_infinipath.so.1(ips_proto_connect+0x334)[0x2b6d107fa9a4]
/usr/lib64/libpsm_infinipath.so.1(__psm_ep_connect+0x692)[0x2b6d107eb172]
/opt/bwhpc/common/mpi/openmpi/1.8.3-gnu-4.9/lib/libmpi.so.1(ompi_mtl_psm_add_procs+0x1a4)[0x2b6d0fa6e384]
/opt/bwhpc/common/mpi/openmpi/1.8.3-gnu-4.9/lib/libmpi.so.1(ompi_comm_get_rprocs+0x2fa)[0x2b6d0f93376a]
/opt/bwhpc/common/mpi/openmpi/1.8.3-gnu-4.9/lib/libmpi.so.1(MPI_Intercomm_create+0x332)[0x2b6d0f963d42]
mpi_test_suite[0x46cd00]
mpi_test_suite[0x44434c]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x2b6d10047d5d]
mpi_test_suite[0x4058e9]
---
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
---

mpi_test_suite:11212 terminated with signal 11 at PC=2b2c27d0d9a4 
SP=75020430.  Backtrace:
/usr/lib64/libpsm_infinipath.so.1(ips_proto_connect+0x334)[0x2b2c27d0d9a4]
/usr/lib64/libpsm_infinipath.so.1(__psm_ep_connect+0x692)[0x2b2c27cfe172]
/opt/bwhpc/common/mpi/openmpi/1.8.3-gnu-4.9/lib/libmpi.so.1(ompi_mtl_psm_add_procs+0x1a4)[0x2b2c26f81384]
/opt/bwhpc/common/mpi/openmpi/1.8.3-gnu-4.9/lib/libmpi.so.1(ompi_comm_get_rprocs+0x2fa)[0x2b2c26e4676a]
/opt/bwhpc/common/mpi/openmpi/1.8.3-gnu-4.9/lib/libmpi.so.1(MPI_Intercomm_create+0x332)[0x2b2c26e76d42]
mpi_test_suite[0x46cd00]
mpi_test_suite[0x44434c]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x2b2c2755ad5d]
mpi_test_suite[0x4058e9]
--
mpirun detected that one or more processes exited with non-zero status, thus 
causing
the job to be terminated. The first process to do so was:

  Process name: [[47415,1],0]
  Exit code:1
--



On Mon, Oct 27, 2014 at 08:27:17AM -0700, Ralph Castain wrote:
> I’m afraid I can’t quite decipher from all this what actually fails. Of 
> course, PSM doesn’t support dynamic operations like comm_spawn or 
> connect_accept, so if you are running those tests that just won’t work. Is 
> that the heart of the problem here?
> 
> 
> > On Oct 27, 2014, at 1:40 AM, Adrian Reber <adr...@lisas.de> wrote:
> > 
> > Running Open MPI 1.8.3 with PSM does not seem to work right now at all.
> > I am getting the same errors also on trunk from my newly set up MTT.
> > Before trying to debug this I just wanted to make sure this is not a
> > configuration error. I have following PSM packages installed:
> > 
> > infinipath-devel-3.1.1-363.1140_rhel6_qlc.noarch
> > infinipath-libs-3.1.1-363.1140_rhel6_qlc.x86_64
> > infinipath-3.1.1-363.1140_rhel6_qlc.x86_64
> > 
> > with 1.6.5 I do not see PSM errors and the test suite fails much later:
> > 
> > P2P tests Many-to-one with MPI_Iprobe (MPI_ANY_SOURCE) (21/48), comm 
> > Intracomm merged of the Halved Intercomm (13/13), type MPI_TYPE_MIX_ARRAY 
> > (28/29)
> > P2P tests Many-to-one with MPI_Iprobe (MPI_ANY_SOURCE) (21/48), comm 
> > Intracomm merged of the Halved Intercomm (13/13), type MPI_TYPE_MIX_LB_UB 
> > (29/29)
> > n050304:5.0.Cannot cancel send requests (req=0x2ad8ba881f80)
> > P2P tests Many-to-one with Isend and Cancellation (22/48), comm 
> > MPI_COMM_WORLD (1/13), type MPI_CHAR (1/29)
> > n050304:2.0.Cannot cancel send requests (req=0x2b25143fbd88)
> > n050302:7.0.Cannot cancel send requests (req=0x2b4d95eb0f80)
> > n050301:4.0.Cannot cancel send requests (req=0x2adf03e14f80)
> > n050304:4.0.Cannot cancel send requests (req=0x2ad877257ed8)
> > n050301:6.0.Cannot cancel send requests (req=0x2ba47634af80)
> > n050304:8.0.Cannot cancel send requests (req=0x2ae8ac16cf80)
> > n050302:3.0.Cannot cancel send requests (req=0x2ab81dcb4d88)
> > n050303:4.0.Cannot cancel send requests (req=0x2b9ef4ef8f80)
> > n050303:2.0.Cannot cancel send requests (req=0x2ab0f03f9f80)
> > n050302:9.0.Cannot cancel send requests (req=0x2b214f9ebed8)
> > n050301:2.0.Cannot cancel send requests (req=0x2b31302d4f80)
> > n050302:4.0.Cannot cancel send requests (req=0x2b0581bd3f80)
> > n050301:8.0.Cannot cancel send requests (req=0x2ae53776bf80)
> > n050303:6.0.Cannot cancel send requests (req=0x2b13eeb78f80)
> > n050304:7.0.Cannot cancel send requests (req=0x2b4e99715f80)
> > n050304:9.0.Cannot cancel send requests (req=0x2b10429c2f80)
> > n050304:3.0.Cannot cancel 

[OMPI devel] 1.8.3 and PSM errors

2014-10-27 Thread Adrian Reber
Running Open MPI 1.8.3 with PSM does not seem to work right now at all.
I am getting the same errors also on trunk from my newly set up MTT.
Before trying to debug this I just wanted to make sure this is not a
configuration error. I have following PSM packages installed:

infinipath-devel-3.1.1-363.1140_rhel6_qlc.noarch
infinipath-libs-3.1.1-363.1140_rhel6_qlc.x86_64
infinipath-3.1.1-363.1140_rhel6_qlc.x86_64

with 1.6.5 I do not see PSM errors and the test suite fails much later:

P2P tests Many-to-one with MPI_Iprobe (MPI_ANY_SOURCE) (21/48), comm Intracomm 
merged of the Halved Intercomm (13/13), type MPI_TYPE_MIX_ARRAY (28/29)
P2P tests Many-to-one with MPI_Iprobe (MPI_ANY_SOURCE) (21/48), comm Intracomm 
merged of the Halved Intercomm (13/13), type MPI_TYPE_MIX_LB_UB (29/29)
n050304:5.0.Cannot cancel send requests (req=0x2ad8ba881f80)
P2P tests Many-to-one with Isend and Cancellation (22/48), comm MPI_COMM_WORLD 
(1/13), type MPI_CHAR (1/29)
n050304:2.0.Cannot cancel send requests (req=0x2b25143fbd88)
n050302:7.0.Cannot cancel send requests (req=0x2b4d95eb0f80)
n050301:4.0.Cannot cancel send requests (req=0x2adf03e14f80)
n050304:4.0.Cannot cancel send requests (req=0x2ad877257ed8)
n050301:6.0.Cannot cancel send requests (req=0x2ba47634af80)
n050304:8.0.Cannot cancel send requests (req=0x2ae8ac16cf80)
n050302:3.0.Cannot cancel send requests (req=0x2ab81dcb4d88)
n050303:4.0.Cannot cancel send requests (req=0x2b9ef4ef8f80)
n050303:2.0.Cannot cancel send requests (req=0x2ab0f03f9f80)
n050302:9.0.Cannot cancel send requests (req=0x2b214f9ebed8)
n050301:2.0.Cannot cancel send requests (req=0x2b31302d4f80)
n050302:4.0.Cannot cancel send requests (req=0x2b0581bd3f80)
n050301:8.0.Cannot cancel send requests (req=0x2ae53776bf80)
n050303:6.0.Cannot cancel send requests (req=0x2b13eeb78f80)
n050304:7.0.Cannot cancel send requests (req=0x2b4e99715f80)
n050304:9.0.Cannot cancel send requests (req=0x2b10429c2f80)
n050304:3.0.Cannot cancel send requests (req=0x2b9196f5fe30)
n050304:6.0.Cannot cancel send requests (req=0x2b30d6c69ed8)
n050301:9.0.Cannot cancel send requests (req=0x2b93c9e04f80)
n050303:9.0.Cannot cancel send requests (req=0x2ab4d6ce0f80)
n050301:5.0.Cannot cancel send requests (req=0x2b6ad851ef80)
n050303:3.0.Cannot cancel send requests (req=0x2b8ef52a0f80)
n050301:3.0.Cannot cancel send requests (req=0x2b277a4aff80)
n050303:7.0.Cannot cancel send requests (req=0x2ba570fa9f80)
n050301:7.0.Cannot cancel send requests (req=0x2ba707dfbf80)
n050302:2.0.Cannot cancel send requests (req=0x2b90f2e51e30)
n050303:5.0.Cannot cancel send requests (req=0x2b1250ba8f80)
n050302:8.0.Cannot cancel send requests (req=0x2b22e0129ed8)
n050303:8.0.Cannot cancel send requests (req=0x2b6609792f80)
n050302:6.0.Cannot cancel send requests (req=0x2b2b6081af80)
n050302:5.0.Cannot cancel send requests (req=0x2ab24f6f1f80)
--
mpirun has exited due to process rank 14 with PID 4496 on
node n050303 exiting improperly. There are two reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--
[adrian@n050304 mpi_test_suite]$

and this are my PSM errors with 1.8.3:

[adrian@n050304 mpi_test_suite]$ mpirun  -np 32  mpi_test_suite -t 
"All,^io,^one-sided"

mpi_test_suite:8904 terminated with signal 11 at PC=2b08466239a4 
SP=703c6e30.  Backtrace:

mpi_test_suite:16905 terminated with signal 11 at PC=2ae4cad209a4 
SP=7fffceefa730.  Backtrace:

mpi_test_suite:3171 terminated with signal 11 at PC=2b57daafe9a4 
SP=7fff5c4b3af0.  Backtrace:

mpi_test_suite:16906 terminated with signal 11 at PC=2b4c9fa019a4 
SP=7fffe916c330.  Backtrace:

mpi_test_suite:3172 terminated with signal 11 at PC=2b6dde92e9a4 
SP=7fff04cf1730.  Backtrace:

mpi_test_suite:16907 terminated with signal 11 at PC=2ad6eb8589a4 
SP=7fffc30d02f0.  Backtrace:

mpi_test_suite:3173 terminated with signal 11 at PC=2b2e4aec89a4 
SP=7fffa054e230.  Backtrace:

mpi_test_suite:16908 terminated with signal 11 at PC=2b4e6e5589a4 
SP=7fff68c7a1f0.  Backtrace:

mpi_test_suite:3174 terminated with signal 11 at PC=2b7049b279a4 
SP=7fff99a49f70.  Backtrace:

mpi_test_suite:16909 terminated with signal 11 at PC=2b252219d9a4 
SP=7fff72a0c6b0.  Backtrace:

mpi_test_suite:3175 terminated with signal 11 at PC=2ac8d5caf9a4 
SP=7fff6d7a63f0.  Backtrace:

mpi_test_suite:16910 terminated with signal 11 at 

Re: [OMPI devel] ORTE headers in OPAL source

2014-10-17 Thread Adrian Reber
Josh,

I had a look at the code (e.g., opal/mca/btl/sm/btl_sm.c) and there are
two uses of orte code:

if (orte_cr_continue_like_restart)

and

 /* On restart we need the old file names to exist (not necessarily
  * contain content) so the CRS component does not fail when  searching
  * for these old file handles. The restart procedure will make sure
  * these files get cleaned up appropriately.
  */
 orte_sstore.set_attr(orte_sstore_handle_current,
  SSTORE_METADATA_LOCAL_TOUCH,
  mca_btl_sm_component.sm_seg->shmem_ds.seg_name);


Do you have an idea how to fix those two? The first variable
orte_cr_continue_like_restart could probably be moved but I am not sure
how to handle the sstore call.

Adrian


On Sat, Aug 09, 2014 at 08:46:31AM -0500, Josh Hursey wrote:
> Those calls should be protected with the CR FT #define - If I remember
> correctly. We were using the sstore to track the shared memory file names
> so we could clean them up on restart.
> 
> I'm not sure if the sstore framework is necessary in this location, since
> we should be able to tell opal_crs and it will do the right thing. I can
> try to look at it early next week if someone doesn't get to it before then.
> 
> -- Josh
> 
> 
> 
> On Sat, Aug 9, 2014 at 7:06 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com>
> wrote:
> 
> > I think you're making a joke, right...?
> >
> > I see direct calls to ORTE sstore functionality in all three.
> >
> >
> >
> >
> > On Aug 8, 2014, at 5:42 PM, George Bosilca <bosi...@icl.utk.edu> wrote:
> >
> > > These are harmless. They are only used when FT is enabled which should
> > rarely be the case.
> > >
> > >   George.
> > >
> > >
> > >
> > > On Fri, Aug 8, 2014 at 4:36 PM, Jeff Squyres (jsquyres) <
> > jsquy...@cisco.com> wrote:
> > > Here's a few ORTE headers in OPAL source -- can respective owners clean
> > these up?  Thanks.
> > >
> > > -
> > > mca/btl/smcuda/btl_smcuda.c
> > > 63:#include "orte/mca/sstore/sstore.h"
> > >
> > > mca/btl/sm/btl_sm.c
> > > 62:#include "orte/mca/sstore/sstore.h"
> > >
> > > mca/mpool/sm/mpool_sm_module.c
> > > 34:#include "orte/mca/sstore/sstore.h"
> > > -
> > >
> > > --
> > > Jeff Squyres
> > > jsquy...@cisco.com
> > > For corporate legal information go to:
> > http://www.cisco.com/web/about/doing_business/legal/cri/
> > >
> > > ___
> > > devel mailing list
> > > de...@open-mpi.org
> > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > Link to this post:
> > http://www.open-mpi.org/community/lists/devel/2014/08/15570.php
> > >
> > > ___
> > > devel mailing list
> > > de...@open-mpi.org
> > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > Link to this post:
> > http://www.open-mpi.org/community/lists/devel/2014/08/15571.php
> >
> >
> > --
> > Jeff Squyres
> > jsquy...@cisco.com
> > For corporate legal information go to:
> > http://www.cisco.com/web/about/doing_business/legal/cri/
> >
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post:
> > http://www.open-mpi.org/community/lists/devel/2014/08/15587.php
> >
> 
> 
> 
> -- 
> Joshua Hursey
> Assistant Professor of Computer Science
> University of Wisconsin-La Crosse
> http://cs.uwlax.edu/~jjhursey

> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/08/15588.php


Adrian

-- 
Adrian Reber <adr...@lisas.de>http://lisas.de/~adrian/
ink, n.:
A villainous compound of tannogallate of iron, gum-arabic,
and water, chiefly used to facilitate the infection of
idiocy and promote intellectual crime.
-- H.L. Mencken


Re: [OMPI devel] ORTE headers in OPAL source

2014-08-11 Thread Adrian Reber
I have seen it. I am still waiting for things to settle down before I
start fixing the FT code ( again ;-)

Adrian

On Mon, Aug 11, 2014 at 01:40:33PM +, Jeff Squyres (jsquyres) wrote:
> Ah, I see.
> 
> Ok -- add it to the list of 
> FT-things-to-be-fixed-before-FT-can-be-supported-again (which I think Josh 
> just did :-) ).
> 
> Also: Adrian -- FYI.  :-)
> 
> 
> On Aug 11, 2014, at 9:05 AM, George Bosilca <bosi...@icl.utk.edu> wrote:
> 
> > I just checked the code and noticed that all the usages of the sstore are 
> > protected by an OPAL_ENABLE_FT_CR define. As we are not supporting FT, I 
> > don't think this is something we should spend time fixing right now.
> > 
> >   George.
> > 
> > 
> > 
> > On Sat, Aug 9, 2014 at 8:06 AM, Jeff Squyres (jsquyres) 
> > <jsquy...@cisco.com> wrote:
> > I think you're making a joke, right...?
> > 
> > I see direct calls to ORTE sstore functionality in all three.
> > 
> > 
> > 
> > 
> > On Aug 8, 2014, at 5:42 PM, George Bosilca <bosi...@icl.utk.edu> wrote:
> > 
> > > These are harmless. They are only used when FT is enabled which should 
> > > rarely be the case.
> > >
> > >   George.
> > >
> > >
> > >
> > > On Fri, Aug 8, 2014 at 4:36 PM, Jeff Squyres (jsquyres) 
> > > <jsquy...@cisco.com> wrote:
> > > Here's a few ORTE headers in OPAL source -- can respective owners clean 
> > > these up?  Thanks.
> > >
> > > -
> > > mca/btl/smcuda/btl_smcuda.c
> > > 63:#include "orte/mca/sstore/sstore.h"
> > >
> > > mca/btl/sm/btl_sm.c
> > > 62:#include "orte/mca/sstore/sstore.h"
> > >
> > > mca/mpool/sm/mpool_sm_module.c
> > > 34:#include "orte/mca/sstore/sstore.h"
> > > -
> > >
> > > --
> > > Jeff Squyres
> > > jsquy...@cisco.com
> > > For corporate legal information go to: 
> > > http://www.cisco.com/web/about/doing_business/legal/cri/
> > >
> > > ___
> > > devel mailing list
> > > de...@open-mpi.org
> > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > Link to this post: 
> > > http://www.open-mpi.org/community/lists/devel/2014/08/15570.php
> > >
> > > ___
> > > devel mailing list
> > > de...@open-mpi.org
> > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > Link to this post: 
> > > http://www.open-mpi.org/community/lists/devel/2014/08/15571.php
> > 
> > 
> > --
> > Jeff Squyres
> > jsquy...@cisco.com
> > For corporate legal information go to: 
> > http://www.cisco.com/web/about/doing_business/legal/cri/
> > 
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post: 
> > http://www.open-mpi.org/community/lists/devel/2014/08/15587.php
> > 
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post: 
> > http://www.open-mpi.org/community/lists/devel/2014/08/15607.php
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/

Adrian

-- 
Adrian Reber <adr...@lisas.de>http://lisas.de/~adrian/
Authentic:
Indubitably true, in somebody's opinion.


Re: [OMPI devel] opal_config_bottom.h question again

2014-08-04 Thread Adrian Reber
And with following change I can get it to compile again:

diff --git a/opal/mca/mpool/base/mpool_base_frame.c 
b/opal/mca/mpool/base/mpool_base_frame.c
index c1b044b..f94b8a5 100644
--- a/opal/mca/mpool/base/mpool_base_frame.c
+++ b/opal/mca/mpool/base/mpool_base_frame.c
@@ -21,12 +21,10 @@
 #include "opal_config.h"

 #include 
+#include 
 #ifdef HAVE_UNISTD_H
 #include  
 #endif  /* HAVE_UNISTD_H */
-#ifdef HAVE_MALLOC_H
-#include 
-#endif

 #include "opal/mca/mca.h"
 #include "opal/mca/base/base.h"
diff --git a/opal/util/malloc.h b/opal/util/malloc.h
index db5a4d0..efeaf98 100644
--- a/opal/util/malloc.h
+++ b/opal/util/malloc.h
@@ -21,7 +21,7 @@
 #ifndef OPAL_MALLOC_H
 #define OPAL_MALLOC_H

-#include "opal_config.h"
+#include 
 #include 

 /*


On Mon, Aug 04, 2014 at 06:39:13PM +0200, Adrian Reber wrote:
> I can confirm this on Fedora 20 with gcc 4.8.3.
> 
> Running ./configure without any options gives me the same error.
> 
> On Mon, Aug 04, 2014 at 04:24:29PM +, Pritchard Jr., Howard wrote:
> > Hi Ralph,
> > 
> > Nope that doesn't fix the problem I'm hitting.   I tried to build the opmi 
> > trunk
> > on a system with a much older gcc compiler (4.4.7) and it compiled :)!  But
> > I'd like to be able to compile opmi with a newer gcc like the one on my 
> > opensuse
> > 13.1 box.
> > 
> > The preprocessor is pulling in the system malloc.h and that's where things 
> > blow up:
> > 
> >   CC   base/mpool_base_frame.lo
> > In file included from ../../../opal/include/opal_config.h:2750:0,
> >  from base/mpool_base_frame.c:21:
> > ../../../opal/include/opal_config_bottom.h:381:38: error: expected 
> > declaration specifiers or '...' before '(' token
> > #define malloc(size) opal_malloc((size), __FILE__, __LINE__)
> >   ^
> > In file included from base/mpool_base_frame.c:28:0:
> > /usr/include/malloc.h:38:1: error: expected declaration specifiers or '...' 
> > before string constant
> > extern void *malloc (size_t __size) __THROW __attribute_malloc__ __wur;
> > ^
> > /usr/include/malloc.h:38:1: error: expected declaration specifiers or '...' 
> > before numeric constant
> > In file included from ../../../opal/include/opal_config.h:2750:0,
> >  from base/mpool_base_frame.c:21:
> > ../../../opal/include/opal_config_bottom.h:385:48: error: expected 
> > declaration specifiers or '...' before '(' token
> > #define calloc(nmembers, size) opal_calloc((nmembers), (size), 
> > __FILE__, __LINE__)
> > ^
> > ../../../opal/include/opal_config_bottom.h:385:60: error: expected 
> > declaration specifiers or '...' before '(' token
> > #define calloc(nmembers, size) opal_calloc((nmembers), (size), 
> > __FILE__, __LINE__)
> > ^
> > In file included from base/mpool_base_frame.c:28:0:
> > /usr/include/malloc.h:41:1: error: expected declaration specifiers or '...' 
> > before string constant
> > extern void *calloc (size_t __nmemb, size_t __size)
> > ^
> > /usr/include/malloc.h:41:1: error: expected declaration specifiers or '...' 
> > before numeric constant
> > In file included from ../../../opal/include/opal_config.h:2750:0,
> >  from base/mpool_base_frame.c:21:
> > ../../../opal/include/opal_config_bottom.h:389:45: error: expected 
> > declaration specifiers or '...' before '(' token
> > #define realloc(ptr, size) opal_realloc((ptr), (size), __FILE__, 
> > __LINE__)
> >  ^
> > ../../../opal/include/opal_config_bottom.h:389:52: error: expected 
> > declaration specifiers or '...' before '(' token
> > #define realloc(ptr, size) opal_realloc((ptr), (size), __FILE__, 
> > __LINE__)
> > ^
> > In file included from base/mpool_base_frame.c:28:0:
> > /usr/include/malloc.h:49:1: error: expected declaration specifiers or '...' 
> > before string constant
> > extern void *realloc (void *__ptr, size_t __size)
> > ^
> > /usr/include/malloc.h:49:1: error: expected declaration specifiers or '...' 
> > before numeric constant
> > In file included from ../../../opal/include/opal_config.h:2750:0,
> >  from base/mpool_base_frame.c:21:
> > ../../../opal/include/opal_config_bottom.h:393:33: error: expected 
> > declaration specifiers or '...' before '(' token
> > #define free(ptr) opal_free((ptr), __FILE__, __LINE__)
> >

Re: [OMPI devel] opal_config_bottom.h question again

2014-08-04 Thread Adrian Reber
I can confirm this on Fedora 20 with gcc 4.8.3.

Running ./configure without any options gives me the same error.

On Mon, Aug 04, 2014 at 04:24:29PM +, Pritchard Jr., Howard wrote:
> Hi Ralph,
> 
> Nope that doesn't fix the problem I'm hitting.   I tried to build the opmi 
> trunk
> on a system with a much older gcc compiler (4.4.7) and it compiled :)!  But
> I'd like to be able to compile opmi with a newer gcc like the one on my 
> opensuse
> 13.1 box.
> 
> The preprocessor is pulling in the system malloc.h and that's where things 
> blow up:
> 
>   CC   base/mpool_base_frame.lo
> In file included from ../../../opal/include/opal_config.h:2750:0,
>  from base/mpool_base_frame.c:21:
> ../../../opal/include/opal_config_bottom.h:381:38: error: expected 
> declaration specifiers or '...' before '(' token
> #define malloc(size) opal_malloc((size), __FILE__, __LINE__)
>   ^
> In file included from base/mpool_base_frame.c:28:0:
> /usr/include/malloc.h:38:1: error: expected declaration specifiers or '...' 
> before string constant
> extern void *malloc (size_t __size) __THROW __attribute_malloc__ __wur;
> ^
> /usr/include/malloc.h:38:1: error: expected declaration specifiers or '...' 
> before numeric constant
> In file included from ../../../opal/include/opal_config.h:2750:0,
>  from base/mpool_base_frame.c:21:
> ../../../opal/include/opal_config_bottom.h:385:48: error: expected 
> declaration specifiers or '...' before '(' token
> #define calloc(nmembers, size) opal_calloc((nmembers), (size), __FILE__, 
> __LINE__)
> ^
> ../../../opal/include/opal_config_bottom.h:385:60: error: expected 
> declaration specifiers or '...' before '(' token
> #define calloc(nmembers, size) opal_calloc((nmembers), (size), __FILE__, 
> __LINE__)
> ^
> In file included from base/mpool_base_frame.c:28:0:
> /usr/include/malloc.h:41:1: error: expected declaration specifiers or '...' 
> before string constant
> extern void *calloc (size_t __nmemb, size_t __size)
> ^
> /usr/include/malloc.h:41:1: error: expected declaration specifiers or '...' 
> before numeric constant
> In file included from ../../../opal/include/opal_config.h:2750:0,
>  from base/mpool_base_frame.c:21:
> ../../../opal/include/opal_config_bottom.h:389:45: error: expected 
> declaration specifiers or '...' before '(' token
> #define realloc(ptr, size) opal_realloc((ptr), (size), __FILE__, __LINE__)
>  ^
> ../../../opal/include/opal_config_bottom.h:389:52: error: expected 
> declaration specifiers or '...' before '(' token
> #define realloc(ptr, size) opal_realloc((ptr), (size), __FILE__, __LINE__)
> ^
> In file included from base/mpool_base_frame.c:28:0:
> /usr/include/malloc.h:49:1: error: expected declaration specifiers or '...' 
> before string constant
> extern void *realloc (void *__ptr, size_t __size)
> ^
> /usr/include/malloc.h:49:1: error: expected declaration specifiers or '...' 
> before numeric constant
> In file included from ../../../opal/include/opal_config.h:2750:0,
>  from base/mpool_base_frame.c:21:
> ../../../opal/include/opal_config_bottom.h:393:33: error: expected 
> declaration specifiers or '...' before '(' token
> #define free(ptr) opal_free((ptr), __FILE__, __LINE__)
>  ^
> In file included from base/mpool_base_frame.c:28:0:
> /usr/include/malloc.h:53:1: error: expected declaration specifiers or '...' 
> before string constant
> extern void free (void *__ptr) __THROW;
> ^
> /usr/include/malloc.h:53:1: error: expected declaration specifiers or '...' 
> before numeric constant
> 
> 
> 
> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Ralph Castain
> Sent: Monday, August 04, 2014 10:09 AM
> To: Open MPI Developers
> Subject: Re: [OMPI devel] opal_config_bottom.h question again
> 
> I believe the issue is actually in opal/util/malloc.h, Howard. I noticed this 
> while looking around this weekend - someone included opal_config.h in the 
> malloc.h file even though it explicitly says "DON'T DO THIS"  in that header 
> file.
> 
> #ifndef OPAL_MALLOC_H
> #define OPAL_MALLOC_H
> 
> #include "opal_config.h"
> #include 
> 
> /*
>  * THIS FILE CANNOT INCLUDE ANY OTHER OPAL HEADER FILES!!!
>  *
>  * It is included via .  Hence, it should not
>  * include ANY other files, nor should it include "opal_config.h".
>  *
>  */
> 
> Don't know why someone did that, but you might see if it fixes your problem
> 
> 
> On Aug 4, 2014, at 9:00 AM, Pritchard Jr., Howard 
> > wrote:
> 
> 
> Hi Folks,
> 
> As I said last week, I'm noticing now that on my opensuse 13.1 system and gcc 
> 4.8.1, when I do a fresh
> checkout of trunk ompi and try to build, 

Re: [OMPI devel] r31916 question

2014-06-19 Thread Adrian Reber
The fault tolerance code also needs additional changes because of this
commit. I have the changes prepared but not committed.

On Wed, Jun 18, 2014 at 03:45:11PM -0700, Ralph Castain wrote:
> Huh - thought I got that. Sorry I missed it. Let me take a look and ensure 
> that the alps ras module is setting that attribute
> 
> On Jun 18, 2014, at 2:40 PM, Pritchard, Howard P  wrote:
> 
> > Hello Folks,
> >  
> > I’m looking at commit 31916 and notice a lot of fields were remote from 
> > orte_node_t.
> > This is now preventing ras_alps_module.c from compiling owing to use of a 
> > “launch_id”
> > field.
> >  
> > In lieu of the direct use of launch_id, should I replace the code around 
> > 587 of this file with
> > use of orte_get_attribute with ORTE_NODE_LAUNCH_ID for the attribute to be 
> > retrieved?
> >  
> > Thanks,
> >  
> > Howard
> >  
> >  
> > -
> > Howard Pritchard
> > HPC-5
> > Los Alamos National Laboratory
> >  
> >  
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post: 
> > http://www.open-mpi.org/community/lists/devel/2014/06/15008.php
> 

> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/06/15009.php



Re: [OMPI devel] RFC: Remove heterogeneous support

2014-04-25 Thread Adrian Reber
On Fri, Apr 25, 2014 at 10:29:36AM +, Jeff Squyres (jsquyres) wrote:
> On Apr 25, 2014, at 6:13 AM, Gilles Gouaillardet 
>  wrote:
> 
> > it is possible to use qemu in order to emulate unavailable hardware.
> > for what it's worth, i am now running a ppc64 qemu emulated virtual
> > machine on an x86_64 workstation.
> > this is pretty slow (2 hours for configure and even more for make) but
> > enough to make simple tests/debugging.
> 
> 
> Fair point.  I have a (very) dim recollection of someone raising the same 
> point the last time we talked about heterogeneity.
> 
> If someone would volunteer to do this, and run MTT in this setup on a 
> regular, preferably automated, schedule (something even as low as once a week 
> would probably be fine), that would probably change the equation.

I have access to ppc64 machines (PPC970MP and PowerXCell) on which I
could set up MTT.

Adrian


Re: [OMPI devel] 1-question developer poll

2014-04-16 Thread Adrian Reber
On Wed, Apr 16, 2014 at 10:32:10AM +, Jeff Squyres (jsquyres) wrote:
> What source code repository technology(ies) do you use for Open MPI 
> development? (indicate all that apply)
> 
> - SVN
> - Mercurial
> - Git

git

Adrian


pgp0Qj8qxYTHc.pgp
Description: PGP signature


[OMPI devel] Open MPI and CRIU stdout/stderr

2014-03-19 Thread Adrian Reber
Cross-posting to criu and openmpi devel mailinglists.

To get fault tolerance back into Open MPI I added code to use criu as
a checkpoint/restart tool. I can checkpoint a process successfully
but I have troubles restarting it. CRIU has currently problems restoring
the process which is probably related stdout/stderr handling.

(00.026198)  15852: Error (tty.c:541): tty: Can't dup SELF_STDIN_OFF: Bad file 
descriptor

What does Open MPI do with the file descriptors for stdout/stderr?

Would it make sense to close stdout/stderr of each checkpointed process
before checkpointing it?

Is there something concerning stdout/stderr which I forgot to handle?

Adrian


Re: [OMPI devel] usage of mca variables in orte-restart

2014-03-18 Thread Adrian Reber
Thanks for your fix.

You say that the environment is only taken in
account during register. There is another variable set in the
environment in opal-restart.c. Does the following still work:

opal-restart.c:

(void) mca_base_var_env_name("crs", _env_var);
opal_setenv(tmp_env_var,
expected_crs_comp,
true, );
free(tmp_env_var);
tmp_env_var = NULL;

The preferred checkpointer is selected like this and in
opal_crs_base_select() the following happens:

if( OPAL_SUCCESS != mca_base_select("crs", 
opal_crs_base_framework.framework_output,

_crs_base_framework.framework_components,
(mca_base_module_t **) _module,
(mca_base_component_t **) 
_component) ) {
/* This will only happen if no component was selected */
exit_status = OPAL_ERROR;
goto cleanup;
}

Does the mca_base_var_env_name() influence which crs module
is selected during mca_base_select()? Or do I have to change it
also to mca_base_var_set_value() to select the preferred crs module?

Adrian


On Mon, Mar 17, 2014 at 08:47:16AM -0600, Nathan Hjelm wrote:
> Good catch. Fixing now.
> 
> -Nathan
> 
> On Mon, Mar 17, 2014 at 02:50:02PM +0100, Adrian Reber wrote:
> > On Fri, Mar 14, 2014 at 10:18:06PM +, Hjelm, Nathan T wrote:
> > > The preferred way is to use mca_base_var_find and then call 
> > > mca_base_var_[set|get]_value. For performance sake we only look at the 
> > > environment when the variable is registered.
> > 
> > I believe I found a bug in mca_base_var_set_value using bool variables:
> > 
> > #0  0x7f6e0d8fb800 in mca_base_var_enum_bool_sfv (self=0x7f6e0dbabc20 
> > , value=0, 
> > string_value=0x0) at ../../../../opal/mca/base/mca_base_var_enum.c:82
> > #1  0x7f6e0d8f45d6 in mca_base_var_set_value (vari=120, value=0x4031e6, 
> > size=0, source=MCA_BASE_VAR_SOURCE_DEFAULT, 
> > source_file=0x0) at ../../../../opal/mca/base/mca_base_var.c:636
> > #2  0x00401e44 in main (argc=7, argv=0x7fffa72a0a78) at 
> > ../../../../opal/tools/opal-restart/opal-restart.c:223
> > 
> > I am using set_value like this:
> > 
> > bool test=false;
> > mca_base_var_set_value(idx, , 0, MCA_BASE_VAR_SOURCE_DEFAULT, NULL);
> > 
> > As the size is ignored I am just setting it to '0'.
> > 
> > mca_base_var_set_value() does 
> > 
> > ret = var->mbv_enumerator->string_from_value(var->mbv_enumerator,((int *) 
> > value)[0], NULL);
> > 
> > which calls mca_base_var_enum_bool_sfv() with the last parameter set to 
> > NULL:
> > 
> > static int mca_base_var_enum_bool_sfv (mca_base_var_enum_t *self, const int 
> > value,
> >const char **string_value)
> > {
> > *string_value = value ? "true" : "false";
> > 
> > return OPAL_SUCCESS;
> > }
> > 
> > and here it tries to access the last parameter (string_value) which has
> > been set to NULL. As I cannot find any usage of mca_base_var_set_value()
> > with bool variables this code path has probably not been used until now.
> > 
> > Adrian
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post: 
> > http://www.open-mpi.org/community/lists/devel/2014/03/14354.php



> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/03/14355.php


Adrian

-- 
Adrian Reber <adr...@lisas.de>http://lisas.de/~adrian/
printk(KERN_ERR "msp3400: chip reset failed, penguin on i2c bus?\n");
2.2.16 /usr/src/linux/drivers/char/msp3400.c


pgph76CYFEG_J.pgp
Description: PGP signature


Re: [OMPI devel] usage of mca variables in orte-restart

2014-03-17 Thread Adrian Reber
On Fri, Mar 14, 2014 at 10:18:06PM +, Hjelm, Nathan T wrote:
> The preferred way is to use mca_base_var_find and then call 
> mca_base_var_[set|get]_value. For performance sake we only look at the 
> environment when the variable is registered.

I believe I found a bug in mca_base_var_set_value using bool variables:

#0  0x7f6e0d8fb800 in mca_base_var_enum_bool_sfv (self=0x7f6e0dbabc20 
, value=0, 
string_value=0x0) at ../../../../opal/mca/base/mca_base_var_enum.c:82
#1  0x7f6e0d8f45d6 in mca_base_var_set_value (vari=120, value=0x4031e6, 
size=0, source=MCA_BASE_VAR_SOURCE_DEFAULT, 
source_file=0x0) at ../../../../opal/mca/base/mca_base_var.c:636
#2  0x00401e44 in main (argc=7, argv=0x7fffa72a0a78) at 
../../../../opal/tools/opal-restart/opal-restart.c:223

I am using set_value like this:

bool test=false;
mca_base_var_set_value(idx, , 0, MCA_BASE_VAR_SOURCE_DEFAULT, NULL);

As the size is ignored I am just setting it to '0'.

mca_base_var_set_value() does 

ret = var->mbv_enumerator->string_from_value(var->mbv_enumerator,((int *) 
value)[0], NULL);

which calls mca_base_var_enum_bool_sfv() with the last parameter set to NULL:

static int mca_base_var_enum_bool_sfv (mca_base_var_enum_t *self, const int 
value,
   const char **string_value)
{
*string_value = value ? "true" : "false";

return OPAL_SUCCESS;
}

and here it tries to access the last parameter (string_value) which has
been set to NULL. As I cannot find any usage of mca_base_var_set_value()
with bool variables this code path has probably not been used until now.

Adrian


Re: [OMPI devel] usage of mca variables in orte-restart

2014-03-15 Thread Adrian Reber
Thanks, that was the information I needed.

On Fri, Mar 14, 2014 at 10:18:06PM +, Hjelm, Nathan T wrote:
> The preferred way is to use mca_base_var_find and then call 
> mca_base_var_[set|get]_value. For performance sake we only look at the 
> environment when the variable is registered.
> 
> -Nathan
> 
> Please excuse the horrible Outlook top-posting. OWA sucks.
> 
> 
> From: devel [devel-boun...@open-mpi.org] on behalf of Adrian Reber 
> [adr...@lisas.de]
> Sent: Friday, March 14, 2014 3:05 PM
> To: de...@open-mpi.org
> Subject: [OMPI devel] usage of mca variables in orte-restart
> 
> I am now trying to run orte-restart. As far as I understand it
> orte-restart analyzes the checkpoint metadata and then tries to exec()
> mpirun which then starts opal-restart. During the startup of
> opal-restart (during initialize()) detection of the best CRS module is
> disabled:
> 
> /*
>  * Turn off the selection of the CRS component,
>  * we need to do that later
>  */
> (void) mca_base_var_env_name("crs_base_do_not_select", _env_var);
> opal_setenv(tmp_env_var,
> "1", /* turn off the selection */
> true, );
> free(tmp_env_var);
> tmp_env_var = NULL;
> 
> This seems to work. Later when actually selecting the correct CRS module
> to restart the checkpointed process the selection is enabled again:
> 
> /* Re-enable the selection of the CRS component, so we can choose the 
> right one */
> (void) mca_base_var_env_name("crs_base_do_not_select", _env_var);
> opal_setenv(tmp_env_var,
> "0", /* turn on the selection */
> true, );
> free(tmp_env_var);
> tmp_env_var = NULL;
> 
> This does not seem to have an effect. The one reason why it does not work
> is pretty obvious. The mca variable crs_base_do_not_select is registered 
> during
> opal_crs_base_register() and written to the bool variable 
> opal_crs_base_do_not_select
> only once (during register). Later in opal_crs_base_select() this bool
> variable is queried if select should run or not and as it is only changed
> during register it never changes. So from the code flow it cannot work
> and is probably the result of one of the rewrites since C/R was introduced.
> 
> To fix this I am trying to read the value of the MCA variable
> opal_crs_base_do_not_select during opal_crs_base_select() like this:
> 
>  idx = mca_base_var_find("opal", "crs", "base", "do_not_select")
>  mca_base_var_get_value(idx, , NULL, NULL);
> 
> This also seems to work because it is different if I change the first
> opal_setenv() during initialize(). The problem I am seeing is that the
> second opal_setenv() (back to 0) cannot be detected using 
> mca_base_var_get_value().
> 
> So my question is: what is the preferred way to read and write MCA
> variables to access them in the different modules? Is the existing
> code still correct? There is also mca_base_var_set_value() should I rather
> use this to set 'opal_crs_base_do_not_select'. I was, however, not able
> to use mca_base_var_set_value() without a segfault. There are not much
> uses of mca_base_var_set_value() in the existing code and none uses
> a bool variable.
> 
> I also discovered I can just access to global C variable 
> 'opal_crs_base_do_not_select'
> from opal-restart.c as well as from opal_crs_base_select(). This also works.
> This would solve my problem setting and reading MCA variables.
> 
> Adrian


[OMPI devel] usage of mca variables in orte-restart

2014-03-14 Thread Adrian Reber
I am now trying to run orte-restart. As far as I understand it
orte-restart analyzes the checkpoint metadata and then tries to exec()
mpirun which then starts opal-restart. During the startup of
opal-restart (during initialize()) detection of the best CRS module is
disabled:

/* 
 * Turn off the selection of the CRS component,
 * we need to do that later
 */
(void) mca_base_var_env_name("crs_base_do_not_select", _env_var);
opal_setenv(tmp_env_var,
"1", /* turn off the selection */
true, );
free(tmp_env_var);
tmp_env_var = NULL;

This seems to work. Later when actually selecting the correct CRS module
to restart the checkpointed process the selection is enabled again:

/* Re-enable the selection of the CRS component, so we can choose the right 
one */
(void) mca_base_var_env_name("crs_base_do_not_select", _env_var);
opal_setenv(tmp_env_var,
"0", /* turn on the selection */
true, );
free(tmp_env_var);
tmp_env_var = NULL;

This does not seem to have an effect. The one reason why it does not work
is pretty obvious. The mca variable crs_base_do_not_select is registered during
opal_crs_base_register() and written to the bool variable 
opal_crs_base_do_not_select
only once (during register). Later in opal_crs_base_select() this bool
variable is queried if select should run or not and as it is only changed
during register it never changes. So from the code flow it cannot work
and is probably the result of one of the rewrites since C/R was introduced.

To fix this I am trying to read the value of the MCA variable
opal_crs_base_do_not_select during opal_crs_base_select() like this:

 idx = mca_base_var_find("opal", "crs", "base", "do_not_select")
 mca_base_var_get_value(idx, , NULL, NULL);

This also seems to work because it is different if I change the first
opal_setenv() during initialize(). The problem I am seeing is that the
second opal_setenv() (back to 0) cannot be detected using 
mca_base_var_get_value().

So my question is: what is the preferred way to read and write MCA
variables to access them in the different modules? Is the existing
code still correct? There is also mca_base_var_set_value() should I rather
use this to set 'opal_crs_base_do_not_select'. I was, however, not able
to use mca_base_var_set_value() without a segfault. There are not much
uses of mca_base_var_set_value() in the existing code and none uses
a bool variable.

I also discovered I can just access to global C variable 
'opal_crs_base_do_not_select'
from opal-restart.c as well as from opal_crs_base_select(). This also works.
This would solve my problem setting and reading MCA variables.

Adrian


[OMPI devel] orte-restart and PATH

2014-03-12 Thread Adrian Reber
I am using orte-restart without setting my PATH to my Open MPI
installation. I am running /full/path/to/orte-restart and orte-restart
tries to run mpirun to restart the process. This fails on my system
because I do not have any mpirun in my PATH. Is it expected for an Open
MPI installation to set up the PATH variable or should it work using the
absolute path to the binaries?

Should I just set my PATH correctly and be done with it or should
orte-restart figure out the full path to its accompanying mpirun and
start mpirun with the full path?

Adrian


Re: [OMPI devel] C/R and orte_oob

2014-03-10 Thread Adrian Reber
On Fri, Mar 07, 2014 at 06:54:18AM -0800, Ralph Castain wrote:
> > If you like, I can define the required code in the trunk and let you 
> > fill in the event functionality.
>  
>  That would be great.
> >>> 
> >>> Thanks for your changes. When using --with-ft there are a few compiler
> >>> errors which I tried to fix with following patch:
> >>> 
> >>> https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=71521789ef9d248a7eef53030d2ec5de900faa4c
> >> 
> >> That looks okay, with the only caveat being that you wouldn't ordinarily 
> >> pass the state_caddy_t into a function. It's just there to pass along the 
> >> job etc in case the callback function needs to reference something. In 
> >> this case, I can't think of anything the FT event function would need to 
> >> know - you just want it to quiet all messaging.
> > 
> > I need to pass the type of state to the ft_event() functions:
> > 
> > enum opal_crs_state_type_t {
> >OPAL_CRS_NONE= 0,
> >OPAL_CRS_CHECKPOINT  = 1,
> >OPAL_CRS_RESTART_PRE = 2,
> >OPAL_CRS_RESTART = 3, /* RESTART_POST */
> > 
> > so an int is all I need. So I probably need to encode it into *cbdata. Do I
> > just use an int directly in *cbdata or should it be part of a struct?
> 
> Why don't you define a job state for each of those, and then you can walk the 
> state machine thru them if needed? That way the state caddy will already 
> provide you with the state and you can just pass it to the functions.

Like this?

https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=79d6c8262bf809bb2f9ecc853d4a7a42a88654da

Adrian


Re: [OMPI devel] C/R and orte_oob

2014-03-07 Thread Adrian Reber
On Thu, Mar 06, 2014 at 07:47:22PM -0800, Ralph Castain wrote:
> > Sorry for delay - yes, that looks like the right direction. I would 
> > suggest doing it via the current state machine, though, by simply 
> > defining another job or proc state in orte/mca/plm/plm_types.h, and 
> > then registering a callback function using the 
> > orte_state.add_job[proc]_state(state, function to be called, 
> > ORTE_ERR_PRI). Then you can activate it by calling 
> > ORTE_ACTIVATE_JOB[PROC]_STATE(NULL, state) and it will be handled in 
> > the proper order.
>  
>  What is a job/proc in the Open MPI context.
> >>> 
> >>> A "job" is the entire application, while a "proc" is just one process in 
> >>> that application. In this case you could use either one as you are 
> >>> checkpointing the entire job, but all this activity is occurring inside 
> >>> each proc. So I'd suggest defining it as a proc state since it only 
> >>> really involves local actions.
> >>> 
> >>> If you like, I can define the required code in the trunk and let you fill 
> >>> in the event functionality.
> >> 
> >> That would be great.
> > 
> > Thanks for your changes. When using --with-ft there are a few compiler
> > errors which I tried to fix with following patch:
> > 
> > https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=71521789ef9d248a7eef53030d2ec5de900faa4c
> 
> That looks okay, with the only caveat being that you wouldn't ordinarily pass 
> the state_caddy_t into a function. It's just there to pass along the job etc 
> in case the callback function needs to reference something. In this case, I 
> can't think of anything the FT event function would need to know - you just 
> want it to quiet all messaging.

I need to pass the type of state to the ft_event() functions:

enum opal_crs_state_type_t {
OPAL_CRS_NONE= 0,
OPAL_CRS_CHECKPOINT  = 1,
OPAL_CRS_RESTART_PRE = 2,
OPAL_CRS_RESTART = 3, /* RESTART_POST */

so an int is all I need. So I probably need to encode it into *cbdata. Do I
just use an int directly in *cbdata or should it be part of a struct?

Adrian


Re: [OMPI devel] C/R and orte_oob

2014-03-06 Thread Adrian Reber
On Tue, Feb 18, 2014 at 03:46:58PM +0100, Adrian Reber wrote:
> > >>> I tried to implement something like you described. It is not yet event
> > >>> driven, but before continuing I wanted to get some feedback if it is at
> > >>> least the right start:
> > >>> 
> > >>> https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=5048a9cec2cd0bc4867eadfd7e48412b73267706
> > >>> 
> > >>> I looked at the other ORTE_OOB_* macros and tried to model my
> > >>> functionality a bit after what I have seen there. Right now it is still
> > >>> a simple function which just tries to call ft_event() on all oob
> > >>> components. Does this look right so far?
> > >> 
> > >> Sorry for delay - yes, that looks like the right direction. I would 
> > >> suggest doing it via the current state machine, though, by simply 
> > >> defining another job or proc state in orte/mca/plm/plm_types.h, and then 
> > >> registering a callback function using the 
> > >> orte_state.add_job[proc]_state(state, function to be called, 
> > >> ORTE_ERR_PRI). Then you can activate it by calling 
> > >> ORTE_ACTIVATE_JOB[PROC]_STATE(NULL, state) and it will be handled in the 
> > >> proper order.
> > > 
> > > What is a job/proc in the Open MPI context.
> > 
> > A "job" is the entire application, while a "proc" is just one process in 
> > that application. In this case you could use either one as you are 
> > checkpointing the entire job, but all this activity is occurring inside 
> > each proc. So I'd suggest defining it as a proc state since it only really 
> > involves local actions.
> > 
> > If you like, I can define the required code in the trunk and let you fill 
> > in the event functionality.
> 
> That would be great.

Thanks for your changes. When using --with-ft there are a few compiler
errors which I tried to fix with following patch:

https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=71521789ef9d248a7eef53030d2ec5de900faa4c

Adrian


Re: [OMPI devel] Fix compiler warnings in FT code

2014-03-05 Thread Adrian Reber
Josh, please have a look at:

https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=5d5edafa36605ca7650eafa7f99fa1985641e488

I moved the parameter initialization to sstore_stage_register() and can
now see that the variables are correctly used:

$ orterun --mca sstore_stage_verbose 30
[...]
[dcbz:02880] sstore:stage: open()
[dcbz:02880] sstore:stage: open: priority   = 10
[dcbz:02880] sstore:stage: open: verbosity  = 30
[dcbz:02880] sstore:stage: open: Local snapshot directory = /tmp
[dcbz:02880] sstore:stage: open: Is Global dir. shared= False
[dcbz:02880] sstore:stage: open: Node Local Caching   = Disabled
[dcbz:02880] sstore:stage: open: Compression  = Disabled
[dcbz:02880] sstore:stage: open: Compression Delay= 0
[dcbz:02880] sstore:stage: open: Skip FileM (Debug Only)  = False



On Mon, Mar 03, 2014 at 05:42:13PM +0100, Adrian Reber wrote:
> I will prepare a patch that moves the parameter initialization somewhere else
> and will not remove it. Do you think the other parts of the patch can be
> applied (without sstore_stage_select() removal)?
> 
> 
> On Mon, Mar 03, 2014 at 10:07:36AM -0600, Josh Hursey wrote:
> > It should probably be moved to the component initialization of the sstore
> > stage component since those parameters are how the user controls where to
> > store those files. I think there is an MCA registration function that is
> > called after component initialization - that would be the best spot, but I
> > do not remember how to set it up at the moment.
> > 
> > 
> > 
> > 
> > On Mon, Mar 3, 2014 at 7:25 AM, Adrian Reber <adr...@lisas.de> wrote:
> > 
> > > I removed a complete function because it was not used:
> > >
> > > ../../../../../orte/mca/sstore/stage/sstore_stage_component.c: At top
> > > level:
> > > ../../../../../orte/mca/sstore/stage/sstore_stage_component.c:77:12:
> > > warning: 'sstore_stage_select' defined but not used [-Wunused-function]
> > >  static int sstore_stage_select (void)
> > >
> > > And grepping through the code it seems the compiler is right.
> > >
> > > Should we keep the code and maybe just #ifdef it out.
> > >
> > > On Mon, Mar 03, 2014 at 07:17:19AM -0600, Josh Hursey wrote:
> > > > It looks like you removed a number of sstore stage MCA parameters. Did
> > > they
> > > > move somewhere else? or do you have a different way to set those
> > > parameters?
> > > >
> > > > Other than that it looks good to me.
> > > >
> > > >
> > > > On Mon, Mar 3, 2014 at 5:29 AM, Adrian Reber <adr...@lisas.de> wrote:
> > > >
> > > > > I have a simple patch which fixes the remaining compiler warnings when
> > > > > running with '--with-ft':
> > > > >
> > > > >
> > > > >
> > > https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=4dee703a0a2e64972b0c35b7693c11a09f1fbe5f
> > > > >
> > > > > Does anybody see any problems with this patch?
> > > > >
> > > > > Adrian
> > > > > _______
> > > > > devel mailing list
> > > > > de...@open-mpi.org
> > > > > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > ___
> > > devel mailing list
> > > de...@open-mpi.org
> > > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Adrian

-- 
Adrian Reber <adr...@lisas.de>http://lisas.de/~adrian/
guru, n.:
A person in T-shirt and sandals who took an elevator ride with
a senior vice-president and is ultimately responsible for the
phone call you are about to receive from your boss.


Re: [OMPI devel] mca_base_component_distill_checkpoint_ready variable

2014-03-03 Thread Adrian Reber
On Fri, Feb 21, 2014 at 10:12:54AM -0700, Nathan Hjelm wrote:
> On Fri, Feb 21, 2014 at 05:21:10PM +0100, Adrian Reber wrote:
> > There is a variable in the FT code which is not defined and therefore
> > currently #ifdef'd out.
> > 
> > #if (OPAL_ENABLE_FT == 1) && (OPAL_ENABLE_FT_CR == 1)
> > #ifdef ENABLE_FT_FIXED
> > /* FIXME_FT
> >  *
> >  * the variable mca_base_component_distill_checkpoint_ready
> >  * was removed by commit 8181c8273c486bba59b3dead324939eac1a58b8c 
> > (r28237)
> >  * "Introduce the MCA framework system. This formalizes the interface 
> > frameworks must provide."
> >  *
> >  * */
> > if (mca_base_component_distill_checkpoint_ready) {
> > open_only_flags |= MCA_BASE_METADATA_PARAM_CHECKPOINT;
> > }
> > #endif /* ENABLE_FT_FIXED */
> > #endif  /* (OPAL_ENABLE_FT == 1) && (OPAL_ENABLE_FT_CR == 1) */
> > 
> > 
> > The variable 'mca_base_component_distill_checkpoint_ready' used to exist 
> > but was removed
> > with commit 'r28237':
> > 
> > -#if (OPAL_ENABLE_FT == 1) && (OPAL_ENABLE_FT_CR == 1)
> > -{
> > -int param_id = -1;
> > -int param_val = 0;
> > -/*
> > - * Extract supported mca parameters for selection contraints
> > - * Supported Options:
> > - *   - mca_base_component_distill_checkpoint_ready = Checkpoint 
> > Ready
> > - */
> > -param_id = mca_base_param_reg_int_name("mca", 
> > "base_component_distill_checkpoint_ready",
> > -   "Distill only those 
> > components that are Checkpoint Ready", 
> > -   false, false,
> > -   0, _val);
> > -if( 0 != param_val ) { /* Select Checkpoint Ready */
> > -open_only_flags |= MCA_BASE_METADATA_PARAM_CHECKPOINT;
> > -}
> > -}
> > -#endif  /* (OPAL_ENABLE_FT == 1) && (OPAL_ENABLE_FT_CR == 1) */
> > 
> > The variable is defined in contrib/amca-param-sets/ft-enable-cr
> > 
> > mca_base_component_distill_checkpoint_ready=1
> > 
> > Looking at the name of other variable I would say it should be called
> > 
> > opal_base_distill_checkpoint_ready
> > 
> > and probably created with mca_base_var_register() or 
> > mca_base_component_var_register().
> > 
> > What would be the best place to create the variable so that it can be used 
> > again in
> > the FT code?
> 
> Some variables are registered in opal/runtime/opal_params.c. That might
> be a good place to add it.

I added in that file. What do you think of following patch:

https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=a9808e2c4bc765963796eb35878a2e238377


Adrian


pgpDPHILC7koS.pgp
Description: PGP signature


Re: [OMPI devel] Fix compiler warnings in FT code

2014-03-03 Thread Adrian Reber
I will prepare a patch that moves the parameter initialization somewhere else
and will not remove it. Do you think the other parts of the patch can be
applied (without sstore_stage_select() removal)?


On Mon, Mar 03, 2014 at 10:07:36AM -0600, Josh Hursey wrote:
> It should probably be moved to the component initialization of the sstore
> stage component since those parameters are how the user controls where to
> store those files. I think there is an MCA registration function that is
> called after component initialization - that would be the best spot, but I
> do not remember how to set it up at the moment.
> 
> 
> 
> 
> On Mon, Mar 3, 2014 at 7:25 AM, Adrian Reber <adr...@lisas.de> wrote:
> 
> > I removed a complete function because it was not used:
> >
> > ../../../../../orte/mca/sstore/stage/sstore_stage_component.c: At top
> > level:
> > ../../../../../orte/mca/sstore/stage/sstore_stage_component.c:77:12:
> > warning: 'sstore_stage_select' defined but not used [-Wunused-function]
> >  static int sstore_stage_select (void)
> >
> > And grepping through the code it seems the compiler is right.
> >
> > Should we keep the code and maybe just #ifdef it out.
> >
> > On Mon, Mar 03, 2014 at 07:17:19AM -0600, Josh Hursey wrote:
> > > It looks like you removed a number of sstore stage MCA parameters. Did
> > they
> > > move somewhere else? or do you have a different way to set those
> > parameters?
> > >
> > > Other than that it looks good to me.
> > >
> > >
> > > On Mon, Mar 3, 2014 at 5:29 AM, Adrian Reber <adr...@lisas.de> wrote:
> > >
> > > > I have a simple patch which fixes the remaining compiler warnings when
> > > > running with '--with-ft':
> > > >
> > > >
> > > >
> > https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=4dee703a0a2e64972b0c35b7693c11a09f1fbe5f
> > > >
> > > > Does anybody see any problems with this patch?
> > > >
> > > > Adrian
> > > > ___
> > > > devel mailing list
> > > > de...@open-mpi.org
> > > > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel


Re: [OMPI devel] Fix compiler warnings in FT code

2014-03-03 Thread Adrian Reber
I removed a complete function because it was not used:

../../../../../orte/mca/sstore/stage/sstore_stage_component.c: At top level:
../../../../../orte/mca/sstore/stage/sstore_stage_component.c:77:12: warning: 
'sstore_stage_select' defined but not used [-Wunused-function]
 static int sstore_stage_select (void)

And grepping through the code it seems the compiler is right.

Should we keep the code and maybe just #ifdef it out.

On Mon, Mar 03, 2014 at 07:17:19AM -0600, Josh Hursey wrote:
> It looks like you removed a number of sstore stage MCA parameters. Did they
> move somewhere else? or do you have a different way to set those parameters?
> 
> Other than that it looks good to me.
> 
> 
> On Mon, Mar 3, 2014 at 5:29 AM, Adrian Reber <adr...@lisas.de> wrote:
> 
> > I have a simple patch which fixes the remaining compiler warnings when
> > running with '--with-ft':
> >
> >
> > https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=4dee703a0a2e64972b0c35b7693c11a09f1fbe5f
> >
> > Does anybody see any problems with this patch?
> >
> > Adrian
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel


[OMPI devel] Fix compiler warnings in FT code

2014-03-03 Thread Adrian Reber
I have a simple patch which fixes the remaining compiler warnings when
running with '--with-ft':

https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=4dee703a0a2e64972b0c35b7693c11a09f1fbe5f

Does anybody see any problems with this patch?

Adrian


[OMPI devel] openmpi-1.7.5a1r30797 fails building on SL 5.5

2014-02-22 Thread Adrian Reber
On a Scientific Linux 5.5 system the nightly snapshot
openmpi-1.7.5a1r30797 fails to build with following errors:


Making all in romio
make[3]: Entering directory 
`/tmp/adrian/openmpi-compile/openmpi-1.7.5a1r30797/build/ompi/mca/io/romio/romio'
make[4]: Entering directory 
`/tmp/adrian/openmpi-compile/openmpi-1.7.5a1r30797/build/ompi/mca/io/romio/romio'
make[4]: Leaving directory 
`/tmp/adrian/openmpi-compile/openmpi-1.7.5a1r30797/build/ompi/mca/io/romio/romio'
make[3]: Leaving directory 
`/tmp/adrian/openmpi-compile/openmpi-1.7.5a1r30797/build/ompi/mca/io/romio/romio'
make[3]: Entering directory 
`/tmp/adrian/openmpi-compile/openmpi-1.7.5a1r30797/build/ompi/mca/io/romio'
  CCLD mca_io_romio.la
romio/.libs/libromio_dist.a(delete.o): In function `lstat64':
delete.c:(.text+0x0): multiple definition of `lstat64'
romio/.libs/libromio_dist.a(close.o):close.c:(.text+0x0): first defined here
romio/.libs/libromio_dist.a(fsync.o): In function `lstat64':
fsync.c:(.text+0x0): multiple definition of `lstat64'
romio/.libs/libromio_dist.a(close.o):close.c:(.text+0x0): first defined here
romio/.libs/libromio_dist.a(get_amode.o): In function `lstat64':
get_amode.c:(.text+0x0): multiple definition of `lstat64'
romio/.libs/libromio_dist.a(close.o):close.c:(.text+0x0): first defined here
romio/.libs/libromio_dist.a(get_atom.o): In function `lstat64':
get_atom.c:(.text+0x0): multiple definition of `lstat64'

and many more of those errors. 1.7.4 also fails.

Following can be seen during configure (with no parameters):

WARNING: Unknown architecture ... proceeding anyway

Adrian


[OMPI devel] mca_base_component_distill_checkpoint_ready variable

2014-02-21 Thread Adrian Reber
There is a variable in the FT code which is not defined and therefore
currently #ifdef'd out.

#if (OPAL_ENABLE_FT == 1) && (OPAL_ENABLE_FT_CR == 1)
#ifdef ENABLE_FT_FIXED
/* FIXME_FT
 *
 * the variable mca_base_component_distill_checkpoint_ready
 * was removed by commit 8181c8273c486bba59b3dead324939eac1a58b8c (r28237)
 * "Introduce the MCA framework system. This formalizes the interface 
frameworks must provide."
 *
 * */
if (mca_base_component_distill_checkpoint_ready) {
open_only_flags |= MCA_BASE_METADATA_PARAM_CHECKPOINT;
}
#endif /* ENABLE_FT_FIXED */
#endif  /* (OPAL_ENABLE_FT == 1) && (OPAL_ENABLE_FT_CR == 1) */


The variable 'mca_base_component_distill_checkpoint_ready' used to exist but 
was removed
with commit 'r28237':

-#if (OPAL_ENABLE_FT == 1) && (OPAL_ENABLE_FT_CR == 1)
-{
-int param_id = -1;
-int param_val = 0;
-/*
- * Extract supported mca parameters for selection contraints
- * Supported Options:
- *   - mca_base_component_distill_checkpoint_ready = Checkpoint Ready
- */
-param_id = mca_base_param_reg_int_name("mca", 
"base_component_distill_checkpoint_ready",
-   "Distill only those components 
that are Checkpoint Ready", 
-   false, false,
-   0, _val);
-if( 0 != param_val ) { /* Select Checkpoint Ready */
-open_only_flags |= MCA_BASE_METADATA_PARAM_CHECKPOINT;
-}
-}
-#endif  /* (OPAL_ENABLE_FT == 1) && (OPAL_ENABLE_FT_CR == 1) */

The variable is defined in contrib/amca-param-sets/ft-enable-cr

mca_base_component_distill_checkpoint_ready=1

Looking at the name of other variable I would say it should be called

opal_base_distill_checkpoint_ready

and probably created with mca_base_var_register() or 
mca_base_component_var_register().

What would be the best place to create the variable so that it can be used 
again in
the FT code?

Adrian


Re: [OMPI devel] CRS/CRIU: add code to actually checkpoint a process

2014-02-18 Thread Adrian Reber
On Tue, Feb 18, 2014 at 10:21:23AM -0600, Josh Hursey wrote:
> So when a process is restarted with CRIU, does it resume execution after
> the criu_dump() or somewhere else?

The process is resumed at the same point it was checkpointed with
criu_dump().

> In a continue/leave-running mode after checkpoint the MPI library does not
> need to do quite a much work since we can depend on some things not
> changing (such as the machine name, orted pid, ...).

During criu_dump() nothing changes.

> In a restart mode then the entire library has to be updated - much more
> expensive than the continue mode.

Ah. If I understand you correctly there are C/R methods which require
that the checkpointed process is terminated and needs to be restarted to
continue running. CRIU is completely transparent for the process. It
needs no special environment (LD_PRELOAD) nor any special handling.
criu_dump() pauses the process, checkpoints it and (if desired) lets it
continue in the same state it was before.

> The CRS components that we have supported emerge from their checkpointing
> function (criu_dump in your case) knowing if they are in the continue or
> restart mode. So that CRS function sets the flag according so the rest of
> the library can do the right thing afterwards.

So, I would say CRIU CRS is in continue mode after criu_dump().

> The restart function is called by the opal_restart tool to restart the
> process from an image. Some checkpointers have a library call to restart a
> process others used external tools to do so. So that interface just let's
> the checkpointer decide, given a snapshot image, how it should restart that
> process. The restarted process is assumed to wake up in the
> opal_crs_*_checkpoint function, not opal_crs_*_restart. So the restart
> function name can be a bit misleading.
> 
> Does that help?

That helps a lot. Thanks. I am not 100% sure I understand the restart
case, but I will try to implement it and probably then I will understand
how it works.

Would you say, that for the checkpoint only functionality in continue
mode the patch can be checked in?

    Adrian

> On Tue, Feb 18, 2014 at 4:08 AM, Adrian Reber <adr...@lisas.de> wrote:
> 
> > I think I do not understand your question. So far I have only implemented
> > the
> > checkpoint part and not the restart part.
> >
> > Using criu_dump() the process can  be left in three different
> > states. Without any special handling the process is dumped and then
> > killed. I can also tell criu to leave the process stopped (--leave-stopped)
> > or running (--leave-running). I decided to default to --leave-running so
> > that after the checkpoint has been performed the process continues
> > running where it stopped.
> >
> > What would be the difference between 'being restarted versus continuing
> > after checkpointing'? Right now only 'continuing after checkpoint' is
> > implemented. I do not understand how process 'is being restarted' fits
> > in the checkpoint function.
> >
> > In opal_crs_criu_checkpoint() I am using criu_dump() to
> > checkpoint the process and the plan is to use criu_restore() in
> > opal_crs_criu_restart() (which I have not yet implemented).
> >
> > On Mon, Feb 17, 2014 at 03:45:49PM -0600, Josh Hursey wrote:
> > > It look fine except that the restart state is not flagged. When a process
> > > is restarted does it resume execution inside the criu_dump() function? If
> > > so, is there a way to tell from its return code (or some other mechanism)
> > > that it is being restarted versus continuing after checkpointing?
> > >
> > >
> > > On Mon, Feb 17, 2014 at 2:00 PM, Ralph Castain <r...@open-mpi.org> wrote:
> > >
> > > > Great - looks fine to me!!
> > > >
> > > >
> > > > On Feb 17, 2014, at 11:39 AM, Adrian Reber <adr...@lisas.de> wrote:
> > > >
> > > > > I have prepared a patch I would like to commit which adds to code to
> > > > > actually checkpoint a process. Thanks for the pointers about the
> > string
> > > > > variables I tried to do implement it correctly.
> > > > >
> > > > > CRIU currently has problems with the new OOB usock but I will contact
> > > > > the CRIU developers about this error. Using tcp, checkpointing works.
> > > > >
> > > > > CRIU also has problems with --np > 1, but I am sure this can also be
> > > > > resolved.
> > > > >
> > > > > The patch is at:
> > > > >
> > > > >
> > > >
> > https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=89c9c27c87598706e8f798f84fe9520ee5884492
> > > > >
> > > > >   Adrian
> > > > > ___
> > > > > devel mailing list
> > > > > de...@open-mpi.org
> > > > > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > >
> > > > ___
> > > > devel mailing list
> > > > de...@open-mpi.org
> > > > http://www.open-mpi.org/mailman/listinfo.cgi/devel


Re: [OMPI devel] C/R and orte_oob

2014-02-18 Thread Adrian Reber
On Tue, Feb 18, 2014 at 06:39:12AM -0800, Ralph Castain wrote:
> On Feb 18, 2014, at 6:24 AM, Adrian Reber <adr...@lisas.de> wrote:
> 
> > On Fri, Feb 14, 2014 at 02:51:51PM -0800, Ralph Castain wrote:
> >> On Feb 13, 2014, at 11:26 AM, Adrian Reber <adr...@lisas.de> wrote:
> >>> I tried to implement something like you described. It is not yet event
> >>> driven, but before continuing I wanted to get some feedback if it is at
> >>> least the right start:
> >>> 
> >>> https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=5048a9cec2cd0bc4867eadfd7e48412b73267706
> >>> 
> >>> I looked at the other ORTE_OOB_* macros and tried to model my
> >>> functionality a bit after what I have seen there. Right now it is still
> >>> a simple function which just tries to call ft_event() on all oob
> >>> components. Does this look right so far?
> >> 
> >> Sorry for delay - yes, that looks like the right direction. I would 
> >> suggest doing it via the current state machine, though, by simply defining 
> >> another job or proc state in orte/mca/plm/plm_types.h, and then 
> >> registering a callback function using the 
> >> orte_state.add_job[proc]_state(state, function to be called, 
> >> ORTE_ERR_PRI). Then you can activate it by calling 
> >> ORTE_ACTIVATE_JOB[PROC]_STATE(NULL, state) and it will be handled in the 
> >> proper order.
> > 
> > What is a job/proc in the Open MPI context.
> 
> A "job" is the entire application, while a "proc" is just one process in that 
> application. In this case you could use either one as you are checkpointing 
> the entire job, but all this activity is occurring inside each proc. So I'd 
> suggest defining it as a proc state since it only really involves local 
> actions.
> 
> If you like, I can define the required code in the trunk and let you fill in 
> the event functionality.

That would be great.

Adrian


Re: [OMPI devel] C/R and orte_oob

2014-02-18 Thread Adrian Reber
On Fri, Feb 14, 2014 at 02:51:51PM -0800, Ralph Castain wrote:
> On Feb 13, 2014, at 11:26 AM, Adrian Reber <adr...@lisas.de> wrote:
> > I tried to implement something like you described. It is not yet event
> > driven, but before continuing I wanted to get some feedback if it is at
> > least the right start:
> > 
> > https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=5048a9cec2cd0bc4867eadfd7e48412b73267706
> > 
> > I looked at the other ORTE_OOB_* macros and tried to model my
> > functionality a bit after what I have seen there. Right now it is still
> > a simple function which just tries to call ft_event() on all oob
> > components. Does this look right so far?
> 
> Sorry for delay - yes, that looks like the right direction. I would suggest 
> doing it via the current state machine, though, by simply defining another 
> job or proc state in orte/mca/plm/plm_types.h, and then registering a 
> callback function using the orte_state.add_job[proc]_state(state, function to 
> be called, ORTE_ERR_PRI). Then you can activate it by calling 
> ORTE_ACTIVATE_JOB[PROC]_STATE(NULL, state) and it will be handled in the 
> proper order.

What is a job/proc in the Open MPI context.

Adrian


Re: [OMPI devel] OPAL_CRS_* meaning

2014-02-18 Thread Adrian Reber
I should have read this email before answering the other.

So opal_crs.checkpoint() is used to checkpoint the process as well as
restart the process? I would have expected opal_crs.restart() is used
for restart. I am confused. Looking at CRS/BLCR checkpoint() seems to
only checkpoint and restart() seems to only restart. The comment in
opal/mca/crs/crs.h says the same as you say.


On Mon, Feb 17, 2014 at 03:43:08PM -0600, Josh Hursey wrote:
> These values indicate the current state of the checkpointing lifecycle. In
> particular CONTINUE/RESTART are set by the checkpointer in the CRS (all
> others are used by the INC mechanism). In the opal_crs.checkpoint() call
> the checkpointer will capture the program state and it is possible to
> emerge from this function in one of two scenarios. Either we are continuing
> execution in the original process (Continue state), or we are resuming
> execution from a checkpointed state (Restart state).
> 
> So if the checkpoint was successful, and you are not restarting the process
> then you want OPAL_CRS_CONTINUE.
> 
> If the process is being restarted from a checkpoint file, then we should
> emerge from this function setting the state to OPAL_CRS_RESTART.
> 
> The OPAL_CR_CHECKPOINT state is used in the INC mechanism to notify all of
> the components to prepare for checkpoint (we probably should have called it
> OPAL_CR_PREPARE_FOR_CKPT). So not really used by the CRS mechanisms at all.
> You can see it used in the opal_cr_inc_core_prep() function in
> opal/runtime/opal_cr.c
> 
> -- Josh
> 
> 
> 
> On Mon, Feb 17, 2014 at 9:28 AM, Adrian Reber <adr...@lisas.de> wrote:
> 
> > This is probably for Josh. What is the meaning of the OPAL_CRS_* enums?
> >
> > They are probably used to communicate the state of the CRS modules.
> > OPAL_CRS_ERROR seems to be used in case an error happened. What is the
> > CRS module supposed to set this to if the checkpoint was successful.
> >
> > OPAL_CRS_CONTINUE or OPAL_CRS_CHECKPOINT?
> >
> > Adrian
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >
> 
> 
> 
> -- 
> Joshua Hursey
> Assistant Professor of Computer Science
> University of Wisconsin-La Crosse
> http://cs.uwlax.edu/~jjhursey

> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Re: [OMPI devel] CRS/CRIU: add code to actually checkpoint a process

2014-02-18 Thread Adrian Reber
I think I do not understand your question. So far I have only implemented the
checkpoint part and not the restart part.

Using criu_dump() the process can  be left in three different
states. Without any special handling the process is dumped and then
killed. I can also tell criu to leave the process stopped (--leave-stopped)
or running (--leave-running). I decided to default to --leave-running so
that after the checkpoint has been performed the process continues
running where it stopped.

What would be the difference between 'being restarted versus continuing
after checkpointing'? Right now only 'continuing after checkpoint' is
implemented. I do not understand how process 'is being restarted' fits
in the checkpoint function.

In opal_crs_criu_checkpoint() I am using criu_dump() to
checkpoint the process and the plan is to use criu_restore() in
opal_crs_criu_restart() (which I have not yet implemented).

On Mon, Feb 17, 2014 at 03:45:49PM -0600, Josh Hursey wrote:
> It look fine except that the restart state is not flagged. When a process
> is restarted does it resume execution inside the criu_dump() function? If
> so, is there a way to tell from its return code (or some other mechanism)
> that it is being restarted versus continuing after checkpointing?
> 
> 
> On Mon, Feb 17, 2014 at 2:00 PM, Ralph Castain <r...@open-mpi.org> wrote:
> 
> > Great - looks fine to me!!
> >
> >
> > On Feb 17, 2014, at 11:39 AM, Adrian Reber <adr...@lisas.de> wrote:
> >
> > > I have prepared a patch I would like to commit which adds to code to
> > > actually checkpoint a process. Thanks for the pointers about the string
> > > variables I tried to do implement it correctly.
> > >
> > > CRIU currently has problems with the new OOB usock but I will contact
> > > the CRIU developers about this error. Using tcp, checkpointing works.
> > >
> > > CRIU also has problems with --np > 1, but I am sure this can also be
> > > resolved.
> > >
> > > The patch is at:
> > >
> > >
> > https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=89c9c27c87598706e8f798f84fe9520ee5884492
> > >
> > >   Adrian
> > > ___
> > > devel mailing list
> > > de...@open-mpi.org
> > > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >
> 
> 
> 
> -- 
> Joshua Hursey
> Assistant Professor of Computer Science
> University of Wisconsin-La Crosse
> http://cs.uwlax.edu/~jjhursey

> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


[OMPI devel] CRS/CRIU: add code to actually checkpoint a process

2014-02-17 Thread Adrian Reber
I have prepared a patch I would like to commit which adds to code to
actually checkpoint a process. Thanks for the pointers about the string
variables I tried to do implement it correctly.

CRIU currently has problems with the new OOB usock but I will contact
the CRIU developers about this error. Using tcp, checkpointing works.

CRIU also has problems with --np > 1, but I am sure this can also be
resolved.

The patch is at:

https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=89c9c27c87598706e8f798f84fe9520ee5884492

Adrian


[OMPI devel] How to prefer oob/tcp over oob/usock

2014-02-17 Thread Adrian Reber
With the newly added oob/usock checkpointing with CRIU stopped working.
Is there a way I can prefer oob/tcp on the command line?

Adrian


[OMPI devel] OPAL_CRS_* meaning

2014-02-17 Thread Adrian Reber
This is probably for Josh. What is the meaning of the OPAL_CRS_* enums?

They are probably used to communicate the state of the CRS modules.
OPAL_CRS_ERROR seems to be used in case an error happened. What is the
CRS module supposed to set this to if the checkpoint was successful.

OPAL_CRS_CONTINUE or OPAL_CRS_CHECKPOINT?

Adrian


Re: [OMPI devel] new CRS component added (criu)

2014-02-14 Thread Adrian Reber
Thanks. That almost works. I need this additional change

   [check_crs_criu_good=yes])

 # If we do not want CRIU, then do not compile this component
-AS_IF([test "$with_criu" = "no"],
+AS_IF([test "$with_criu" = "no" || test $check_crs_criu_good = no],
   [check_crs_criu_good=no],
   [check_crs_criu_good=yes])

I will commit your patch with this additional change.

On Fri, Feb 14, 2014 at 04:59:50PM +, Jeff Squyres (jsquyres) wrote:
> Check out this patch:
> 
> 
> https://github.com/jsquyres/fork-from-adrian-ft/commit/f5962184f3ea6dffc182a18f7603c5e70e82ac99
> 
> 
> 
> On Feb 14, 2014, at 11:35 AM, "Jeff Squyres (jsquyres)" <jsquy...@cisco.com> 
> wrote:
> 
> > Perfect; cloning now.  Thanks!
> > 
> > On Feb 14, 2014, at 11:34 AM, Adrian Reber <adr...@lisas.de>
> > wrote:
> > 
> >> Sure. I added the cloneurl information:
> >> 
> >> https://lisas.de/~adrian/open-mpi.git
> >> 
> >> On Fri, Feb 14, 2014 at 04:30:05PM +, Jeff Squyres (jsquyres) wrote:
> >>> Can I clone your git tree and send you a patch?
> >>> 
> >>> On Feb 11, 2014, at 4:45 PM, Adrian Reber <adr...@lisas.de> wrote:
> >>> 
> >>>> On Tue, Feb 11, 2014 at 08:09:35PM +, Jeff Squyres (jsquyres) wrote:
> >>>>> On Feb 8, 2014, at 4:49 PM, Adrian Reber <adr...@lisas.de> wrote:
> >>>>> 
> >>>>>>> I note you have a stray $3 at the end of your configure.m4, too (it 
> >>>>>>> might supposed to be $2?).
> >>>>>> 
> >>>>>> I think I do not really understand configure.m4 and was happy to just
> >>>>>> copy it from blcr. Especially what $2 and $3 mean and how they are
> >>>>>> supposed to be used. I will try to simplify my configure.m4. Is there 
> >>>>>> an
> >>>>>> example which I can have a look at?
> >>>>> 
> >>>>> Sorry -- been a bit busy with releasing OMPI 1.7.4 and preparing for 
> >>>>> 1.7.5...
> >>>>> 
> >>>>> m4 is a macro language, so think of it as templates with some 
> >>>>> intelligence.  
> >>>>> 
> >>>>> $1, $2, and $3 are the "parameters" passed in to the macro.  So when 
> >>>>> you do something like:
> >>>>> 
> >>>>> AC_DEFUN([FOO], [
> >>>>> echo 1 is $1
> >>>>> echo 2 is $2])
> >>>>> 
> >>>>> and you invoke that macro via
> >>>>> 
> >>>>> FOO([hello world], [goodbye world])
> >>>>> 
> >>>>> the generated script will contain:
> >>>>> 
> >>>>> echo 1 is hello world
> >>>>> echo 2 is goodbye world
> >>>>> 
> >>>>> In our case, $1 is the action to execute if the package is happy / 
> >>>>> wants to build, and $2 is the action to execute if the package is 
> >>>>> unhappy / does not want to build.
> >>>>> 
> >>>>> Meaning: we have a top-level engine that is iterating over all 
> >>>>> frameworks and components, and calling their *_CONFIG macros with 
> >>>>> appropriate $1 and $2 values that expand to actions-to-execute-if-happy 
> >>>>> / actions-to-execute-if-unhappy.
> >>>>> 
> >>>>> Make sense?
> >>>> 
> >>>> Thanks. I also tried to understand the macros better and with the
> >>>> generated output and your description I think I understand it.
> >>>> 
> >>>> Trying to simplify configure.m4 like you suggested I would change this:
> >>>> 
> >>>>  AS_IF([test "$check_crs_criu_good" != "yes"], [$2],
> >>>>[AS_IF([test ! -z "$with_criu" -a "$with_criu" != "yes"],
> >>>>   [check_crs_criu_dir="$with_criu"
> >>>>check_crs_criu_dir_msg="$with_criu (from --with-criu)"])
> >>>> AS_IF([test ! -z "$with_criu_libdir" -a "$with_criu_libdir" != 
> >>>> "yes"],
> >>>>   [check_crs_criu_libdir="$with_criu_libdir"
> >>>>check_crs_criu_libdir_msg="$with_criu_libdir (from 
>

Re: [OMPI devel] new CRS component added (criu)

2014-02-14 Thread Adrian Reber
Sure. I added the cloneurl information:

https://lisas.de/~adrian/open-mpi.git

On Fri, Feb 14, 2014 at 04:30:05PM +, Jeff Squyres (jsquyres) wrote:
> Can I clone your git tree and send you a patch?
> 
> On Feb 11, 2014, at 4:45 PM, Adrian Reber <adr...@lisas.de> wrote:
> 
> > On Tue, Feb 11, 2014 at 08:09:35PM +, Jeff Squyres (jsquyres) wrote:
> >> On Feb 8, 2014, at 4:49 PM, Adrian Reber <adr...@lisas.de> wrote:
> >> 
> >>>> I note you have a stray $3 at the end of your configure.m4, too (it 
> >>>> might supposed to be $2?).
> >>> 
> >>> I think I do not really understand configure.m4 and was happy to just
> >>> copy it from blcr. Especially what $2 and $3 mean and how they are
> >>> supposed to be used. I will try to simplify my configure.m4. Is there an
> >>> example which I can have a look at?
> >> 
> >> Sorry -- been a bit busy with releasing OMPI 1.7.4 and preparing for 
> >> 1.7.5...
> >> 
> >> m4 is a macro language, so think of it as templates with some 
> >> intelligence.  
> >> 
> >> $1, $2, and $3 are the "parameters" passed in to the macro.  So when you 
> >> do something like:
> >> 
> >> AC_DEFUN([FOO], [
> >>   echo 1 is $1
> >>   echo 2 is $2])
> >> 
> >> and you invoke that macro via
> >> 
> >>   FOO([hello world], [goodbye world])
> >> 
> >> the generated script will contain:
> >> 
> >>   echo 1 is hello world
> >>   echo 2 is goodbye world
> >> 
> >> In our case, $1 is the action to execute if the package is happy / wants 
> >> to build, and $2 is the action to execute if the package is unhappy / does 
> >> not want to build.
> >> 
> >> Meaning: we have a top-level engine that is iterating over all frameworks 
> >> and components, and calling their *_CONFIG macros with appropriate $1 and 
> >> $2 values that expand to actions-to-execute-if-happy / 
> >> actions-to-execute-if-unhappy.
> >> 
> >> Make sense?
> > 
> > Thanks. I also tried to understand the macros better and with the
> > generated output and your description I think I understand it.
> > 
> > Trying to simplify configure.m4 like you suggested I would change this:
> > 
> >AS_IF([test "$check_crs_criu_good" != "yes"], [$2],
> >  [AS_IF([test ! -z "$with_criu" -a "$with_criu" != "yes"],
> > [check_crs_criu_dir="$with_criu"
> >  check_crs_criu_dir_msg="$with_criu (from --with-criu)"])
> >   AS_IF([test ! -z "$with_criu_libdir" -a "$with_criu_libdir" != 
> > "yes"],
> > [check_crs_criu_libdir="$with_criu_libdir"
> >  check_crs_criu_libdir_msg="$with_criu_libdir (from 
> > --with-criu-libdir)"])
> >  ])
> > 
> > to this:
> > 
> >   AS_IF([test "$check_crs_criu_good" = "yes" -a ! -z "$with_criu" -a 
> > "$with_criu" != "yes"],
> > [check_crs_criu_dir="$with_criu"
> >  check_crs_criu_dir_msg="$with_criu (from --with-criu)"], 
> > [$2
> >  check_crs_criu_good="no"])
> > 
> >   AS_IF([test "$check_crs_criu_good" = "yes" -a ! -z "$with_criu_libdir" -a 
> > "$with_criu_libdir" != "yes"],
> > [check_crs_criu_dir_libdir="$with_criu_libdir"
> >  check_crs_criu_dir_libdir_msg="$with_criu_libdir (from 
> > --with-criu)"],
> > [$2
> >  check_crs_criu_good="no"])
> > 
> > 
> > correct? With three checks in one line it seems bit unreadable
> > and the nested AS_IF seems easier for me to understand.
> > Did I understand it correctly what you meant or did you
> > mean something else?
> > 
> > Adrian
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


[OMPI devel] mca_base_component_var_register() MCA_BASE_VAR_TYPE_STRING

2014-02-14 Thread Adrian Reber
I am trying to find out how to deal with string variables. Do I have to
allocate the memory before calling mca_base_component_var_register() or
not? It seems it does a strdup() meaning it has to be free()'d while
closing the component. Looking at other occurrences of string variables
I see different usages. Sometimes it set to NULL sometimes not. Before
following a wrong example maybe someone can tell me what the correct
usage is.

Adrian


Re: [OMPI devel] C/R and orte_oob

2014-02-13 Thread Adrian Reber
On Thu, Feb 06, 2014 at 02:45:07PM -0800, Ralph Castain wrote:
> On Feb 6, 2014, at 2:16 PM, Adrian Reber <adr...@lisas.de> wrote:
> 
> > Josh explained it to me a few days ago, that after a checkpoint has been
> > received TCP should no longer be used to not lose any messages. The
> > communication happens over named pipes and therefore (I think) OOB
> > ft_event() is used to quite anything besides the pipes. This all seems
> > to work but I was just confused as the functions for ft_event()
> > in oob/tcp and oob/ud do not seem to contain any functionality.
> > 
> > So do I try to fix the ft_event() function in oob/base/ to call the
> > registered ft_event() function which does nothing or do I just remove
> > the call to orte oob ft_event().
> 
> Sounds like you'll need to tell the OOB components to stop processing 
> messages, so that will require that you insert an event into the system. You 
> have to account for two things:
> 
> (a) the OOB base and OOB components are operating on the orte_event_base, but
> 
> (b) each OOB component can have multiple active modules (one per NIC) that 
> are operating on their own event base/thread.
> 
> So you have to start by pushing an event that calls the OOB base, which then 
> loops across the components calling their ft_event interface. Each component 
> would then have to create an event for each active module, inserting that 
> event into the module's event base/thread. When activated, each module would 
> have to shutdown its message engine, and activate another event to notify its 
> component that all is quiet.
> 
> Once a component finds out that all its modules are quiet, it would then have 
> to activate an event to the OOB base. Once the OOB base sees all components 
> report quiet, then it would have to activate an event to take you to the next 
> step in your process.
> 
> In other words, you need to turn the quieting process into its own set of 
> states and run it through the state machine. This is the only way to 
> guarantee that you'll keep things orderly, and is the major change needed in 
> the C/R procedure as it flows thru ORTE. You can't just progress thru a set 
> of function calls as you'll inevitably run into a roadblock requiring that 
> you wait for an event-driven process to complete.

I tried to implement something like you described. It is not yet event
driven, but before continuing I wanted to get some feedback if it is at
least the right start:

https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=5048a9cec2cd0bc4867eadfd7e48412b73267706

I looked at the other ORTE_OOB_* macros and tried to model my
functionality a bit after what I have seen there. Right now it is still
a simple function which just tries to call ft_event() on all oob
components. Does this look right so far?

Adrian


Re: [OMPI devel] openmpi-1.7.5a1r30692 and slurm problems

2014-02-12 Thread Adrian Reber
It seems this is indeed a Moab bug for interactive jobs. At least a bug
was opened against moab. Using non-interactive jobs the variables have
the correct values and mpirun has no problems detecting the correct
number of cores.

On Wed, Feb 12, 2014 at 07:50:40AM -0800, Ralph Castain wrote:
> Another possibility to check - it is entirely possible that Moab is 
> miscommunicating the values to Slurm. You might need to check it - I'll 
> install a copy of 2.6.5 on my machines and see if I get similar issues when 
> Slurm does the allocation itself.
> 
> On Feb 12, 2014, at 7:47 AM, Ralph Castain <r...@open-mpi.org> wrote:
> 
> > 
> > On Feb 12, 2014, at 7:32 AM, Adrian Reber <adr...@lisas.de> wrote:
> > 
> >> 
> >> $ msub -I -l nodes=3:ppn=8
> >> salloc: Job is in held state, pending scheduler release
> >> salloc: Pending job allocation 131828
> >> salloc: job 131828 queued and waiting for resources
> >> salloc: job 131828 has been allocated resources
> >> salloc: Granted job allocation 131828
> >> sh-4.1$ echo $SLURM_TASKS_PER_NODE 
> >> 1
> >> sh-4.1$ rpm -q slurm
> >> slurm-2.6.5-1.el6.x86_64
> >> sh-4.1$ echo $SLURM_NNODES 
> >> 1
> >> sh-4.1$ echo $SLURM_JOB_NODELIST 
> >> [107-108,176]
> >> sh-4.1$ echo $SLURM_JOB_CPUS_PER_NODE 
> >> 8(x3)
> >> sh-4.1$ echo $SLURM_NODELIST 
> >> [107-108,176]
> >> sh-4.1$ echo $SLURM_NPROCS  
> >> 1
> >> sh-4.1$ echo $SLURM_NTASKS 
> >> 1
> >> sh-4.1$ echo $SLURM_TASKS_PER_NODE 
> >> 1
> >> 
> >> The information in *_NODELIST seems to make sense, but all the other
> >> variables (PROCS, TASKS, NODES) report '1', which seems wrong.
> > 
> > Indeed - and that's the problem. Slurm 2.6.5 is the most recent release, 
> > and my guess is that SchedMD once again has changed the @$!#%#@ meaning of 
> > their envars. Frankly, it is nearly impossible to track all the variants 
> > they have created over the years.
> > 
> > Please check to see if someone did a little customizing on your end as 
> > sometimes people do that to Slurm. Could also be they did something in the 
> > Slurm config file that is causing the changed behavior.
> > 
> > Meantime, I'll try to ponder a potential solution in case this really is 
> > the "latest" Slurm screwup.
> > 
> > 
> >> 
> >> 
> >> On Wed, Feb 12, 2014 at 07:19:54AM -0800, Ralph Castain wrote:
> >>> ...and your version of Slurm?
> >>> 
> >>> On Feb 12, 2014, at 7:19 AM, Ralph Castain <r...@open-mpi.org> wrote:
> >>> 
> >>>> What is your SLURM_TASKS_PER_NODE?
> >>>> 
> >>>> On Feb 12, 2014, at 6:58 AM, Adrian Reber <adr...@lisas.de> wrote:
> >>>> 
> >>>>> No, the system has only a few MOAB_* variables and many SLURM_*
> >>>>> variables:
> >>>>> 
> >>>>> $BASH $IFS  $SECONDS
> >>>>>   $SLURM_PTY_PORT
> >>>>> $BASHOPTS $LINENO   $SHELL  
> >>>>>   $SLURM_PTY_WIN_COL
> >>>>> $BASHPID  $LINES$SHELLOPTS  
> >>>>>   $SLURM_PTY_WIN_ROW
> >>>>> $BASH_ALIASES $MACHTYPE $SHLVL  
> >>>>>   $SLURM_SRUN_COMM_HOST
> >>>>> $BASH_ARGC$MAILCHECK
> >>>>> $SLURMD_NODENAME  $SLURM_SRUN_COMM_PORT
> >>>>> $BASH_ARGV$MOAB_CLASS   
> >>>>> $SLURM_CHECKPOINT_IMAGE_DIR   $SLURM_STEPID
> >>>>> $BASH_CMDS$MOAB_GROUP   $SLURM_CONF 
> >>>>>   $SLURM_STEP_ID
> >>>>> $BASH_COMMAND $MOAB_JOBID   
> >>>>> $SLURM_CPUS_ON_NODE   $SLURM_STEP_LAUNCHER_PORT
> >>>>> $BASH_LINENO  $MOAB_NODECOUNT   
> >>>>> $SLURM_DISTRIBUTION   $SLURM_STEP_NODELIST
> >>>>> $BASH_SOURCE  $MOAB_PARTITION   
> >>>>> $SLURM_GTIDS  $SLURM_STEP_NUM_NODES
> >>>>> $BASH_SUBSHELL$MOAB_PROCCOUNT   
&

Re: [OMPI devel] openmpi-1.7.5a1r30692 and slurm problems

2014-02-12 Thread Adrian Reber

$ msub -I -l nodes=3:ppn=8
salloc: Job is in held state, pending scheduler release
salloc: Pending job allocation 131828
salloc: job 131828 queued and waiting for resources
salloc: job 131828 has been allocated resources
salloc: Granted job allocation 131828
sh-4.1$ echo $SLURM_TASKS_PER_NODE 
1
sh-4.1$ rpm -q slurm
slurm-2.6.5-1.el6.x86_64
sh-4.1$ echo $SLURM_NNODES 
1
sh-4.1$ echo $SLURM_JOB_NODELIST 
[107-108,176]
sh-4.1$ echo $SLURM_JOB_CPUS_PER_NODE 
8(x3)
sh-4.1$ echo $SLURM_NODELIST 
[107-108,176]
sh-4.1$ echo $SLURM_NPROCS  
1
sh-4.1$ echo $SLURM_NTASKS 
1
sh-4.1$ echo $SLURM_TASKS_PER_NODE 
1

The information in *_NODELIST seems to make sense, but all the other
variables (PROCS, TASKS, NODES) report '1', which seems wrong.


On Wed, Feb 12, 2014 at 07:19:54AM -0800, Ralph Castain wrote:
> ...and your version of Slurm?
> 
> On Feb 12, 2014, at 7:19 AM, Ralph Castain <r...@open-mpi.org> wrote:
> 
> > What is your SLURM_TASKS_PER_NODE?
> > 
> > On Feb 12, 2014, at 6:58 AM, Adrian Reber <adr...@lisas.de> wrote:
> > 
> >> No, the system has only a few MOAB_* variables and many SLURM_*
> >> variables:
> >> 
> >> $BASH $IFS  $SECONDS   
> >>$SLURM_PTY_PORT
> >> $BASHOPTS $LINENO   $SHELL 
> >>$SLURM_PTY_WIN_COL
> >> $BASHPID  $LINES$SHELLOPTS 
> >>$SLURM_PTY_WIN_ROW
> >> $BASH_ALIASES $MACHTYPE $SHLVL 
> >>$SLURM_SRUN_COMM_HOST
> >> $BASH_ARGC$MAILCHECK
> >> $SLURMD_NODENAME  $SLURM_SRUN_COMM_PORT
> >> $BASH_ARGV$MOAB_CLASS   
> >> $SLURM_CHECKPOINT_IMAGE_DIR   $SLURM_STEPID
> >> $BASH_CMDS$MOAB_GROUP   $SLURM_CONF
> >>$SLURM_STEP_ID
> >> $BASH_COMMAND $MOAB_JOBID   
> >> $SLURM_CPUS_ON_NODE   $SLURM_STEP_LAUNCHER_PORT
> >> $BASH_LINENO  $MOAB_NODECOUNT   
> >> $SLURM_DISTRIBUTION   $SLURM_STEP_NODELIST
> >> $BASH_SOURCE  $MOAB_PARTITION   $SLURM_GTIDS   
> >>$SLURM_STEP_NUM_NODES
> >> $BASH_SUBSHELL$MOAB_PROCCOUNT   $SLURM_JOBID   
> >>$SLURM_STEP_NUM_TASKS
> >> $BASH_VERSINFO$MOAB_SUBMITDIR   
> >> $SLURM_JOB_CPUS_PER_NODE  $SLURM_STEP_TASKS_PER_NODE
> >> $BASH_VERSION $MOAB_USER$SLURM_JOB_ID  
> >>$SLURM_SUBMIT_DIR
> >> $COLUMNS  $OPTERR   
> >> $SLURM_JOB_NODELIST   $SLURM_SUBMIT_HOST
> >> $COMP_WORDBREAKS  $OPTIND   
> >> $SLURM_JOB_NUM_NODES  $SLURM_TASKS_PER_NODE
> >> $DIRSTACK $OSTYPE   
> >> $SLURM_LAUNCH_NODE_IPADDR $SLURM_TASK_PID
> >> $EUID $PATH $SLURM_LOCALID 
> >>$SLURM_TOPOLOGY_ADDR
> >> $GROUPS   $POSIXLY_CORRECT  $SLURM_NNODES  
> >>$SLURM_TOPOLOGY_ADDR_PATTERN
> >> $HISTCMD  $PPID $SLURM_NODEID  
> >>$SRUN_DEBUG
> >> $HISTFILE $PS1  
> >> $SLURM_NODELIST   $TERM
> >> $HISTFILESIZE $PS2  $SLURM_NPROCS  
> >>$TMPDIR
> >> $HISTSIZE $PS4  $SLURM_NTASKS  
> >>$UID
> >> $HOSTNAME $PWD  
> >> $SLURM_PRIO_PROCESS   $_
> >> $HOSTTYPE $RANDOM   $SLURM_PROCID  
> >>
> >> 
> >> 
> >> 
> >> On Wed, Feb 12, 2014 at 06:12:45AM -0800, Ralph Castain wrote:
> >>> Seems rather odd - since this is managed by Moab, you shouldn't be seeing 
> >>> SLURM envars at all. What you should see are PBS_* envars, including a 
> >>> PBS_NODEFILE that actually contains the allocation.
> >>> 
> >>> 
> >>> On Feb 12, 2014, at 4:42 AM, Adrian Reber <adr...@lisas.de> wrote:
> >>&

Re: [OMPI devel] openmpi-1.7.5a1r30692 and slurm problems

2014-02-12 Thread Adrian Reber
No, the system has only a few MOAB_* variables and many SLURM_*
variables:

$BASH $IFS  $SECONDS
  $SLURM_PTY_PORT
$BASHOPTS $LINENO   $SHELL  
  $SLURM_PTY_WIN_COL
$BASHPID  $LINES$SHELLOPTS  
  $SLURM_PTY_WIN_ROW
$BASH_ALIASES $MACHTYPE $SHLVL  
  $SLURM_SRUN_COMM_HOST
$BASH_ARGC$MAILCHECK$SLURMD_NODENAME
  $SLURM_SRUN_COMM_PORT
$BASH_ARGV$MOAB_CLASS   
$SLURM_CHECKPOINT_IMAGE_DIR   $SLURM_STEPID
$BASH_CMDS$MOAB_GROUP   $SLURM_CONF 
  $SLURM_STEP_ID
$BASH_COMMAND $MOAB_JOBID   $SLURM_CPUS_ON_NODE 
  $SLURM_STEP_LAUNCHER_PORT
$BASH_LINENO  $MOAB_NODECOUNT   $SLURM_DISTRIBUTION 
  $SLURM_STEP_NODELIST
$BASH_SOURCE  $MOAB_PARTITION   $SLURM_GTIDS
  $SLURM_STEP_NUM_NODES
$BASH_SUBSHELL$MOAB_PROCCOUNT   $SLURM_JOBID
  $SLURM_STEP_NUM_TASKS
$BASH_VERSINFO$MOAB_SUBMITDIR   
$SLURM_JOB_CPUS_PER_NODE  $SLURM_STEP_TASKS_PER_NODE
$BASH_VERSION $MOAB_USER$SLURM_JOB_ID   
  $SLURM_SUBMIT_DIR
$COLUMNS  $OPTERR   $SLURM_JOB_NODELIST 
  $SLURM_SUBMIT_HOST
$COMP_WORDBREAKS  $OPTIND   
$SLURM_JOB_NUM_NODES  $SLURM_TASKS_PER_NODE
$DIRSTACK $OSTYPE   
$SLURM_LAUNCH_NODE_IPADDR $SLURM_TASK_PID
$EUID $PATH $SLURM_LOCALID  
  $SLURM_TOPOLOGY_ADDR
$GROUPS   $POSIXLY_CORRECT  $SLURM_NNODES   
  $SLURM_TOPOLOGY_ADDR_PATTERN
$HISTCMD  $PPID $SLURM_NODEID   
  $SRUN_DEBUG
$HISTFILE $PS1  $SLURM_NODELIST 
  $TERM
$HISTFILESIZE $PS2  $SLURM_NPROCS   
  $TMPDIR
$HISTSIZE $PS4  $SLURM_NTASKS   
  $UID
$HOSTNAME $PWD  $SLURM_PRIO_PROCESS 
  $_
$HOSTTYPE $RANDOM   $SLURM_PROCID   
  



On Wed, Feb 12, 2014 at 06:12:45AM -0800, Ralph Castain wrote:
> Seems rather odd - since this is managed by Moab, you shouldn't be seeing 
> SLURM envars at all. What you should see are PBS_* envars, including a 
> PBS_NODEFILE that actually contains the allocation.
> 
> 
> On Feb 12, 2014, at 4:42 AM, Adrian Reber <adr...@lisas.de> wrote:
> 
> > I tried the nightly snapshot (openmpi-1.7.5a1r30692.tar.gz) on a system
> > with slurm and moab. I requested an interactive session using:
> > 
> > msub -I -l nodes=3:ppn=8
> > 
> > and started a simple test case which fails:
> > 
> > $ mpirun -np 2 ./mpi-test 1
> > --
> > There are not enough slots available in the system to satisfy the 2 slots 
> > that were requested by the application:
> >  ./mpi-test
> > 
> > Either request fewer slots for your application, or make more slots 
> > available
> > for use.
> > --
> > srun: error: 108: task 1: Exited with exit code 1
> > srun: Terminating job step 131823.4
> > srun: error: 107: task 0: Exited with exit code 1
> > srun: Job step aborted
> > slurmd[108]: *** STEP 131823.4 KILLED AT 2014-02-12T13:30:32 WITH 
> > SIGNAL 9 ***
> > 
> > 
> > requesting only one core works:
> > 
> > $ mpirun  ./mpi-test 1
> > 4.4.7 20120313 (Red Hat 4.4.7-4):Process 0 on 106 out of 1: 0.00
> > 4.4.7 20120313 (Red Hat 4.4.7-4):Process 0 on 106 out of 1: 0.00
> > 
> > 
> > using openmpi-1.6.5 works with multiple cores:
> > 
> > $ mpirun -np 24 ./mpi-test 2
> > 4.4.7 20120313 (Red Hat 4.4.7-4):Process 0 on 106 out of 24: 0.00
> > 4.4.7 20120313 (Red Hat 4.4.7-4):Process 12 on 106 out of 24: 12.00
> > 4.4.7 20120313 (Red Hat 4.4.7-4):Process 11 on 108 out of 24: 11.00
> > 4.4.7 20120313 (Red Hat 4.4.7-4):Process 18 on 106 out of 24: 18.00
> > 
> > $ echo $SLURM_JOB_CPUS_PER_NODE 
> > 8(x3)
> > 
> > I never used slurm before so this c

[OMPI devel] openmpi-1.7.5a1r30692 and slurm problems

2014-02-12 Thread Adrian Reber
I tried the nightly snapshot (openmpi-1.7.5a1r30692.tar.gz) on a system
with slurm and moab. I requested an interactive session using:

msub -I -l nodes=3:ppn=8

and started a simple test case which fails:

$ mpirun -np 2 ./mpi-test 1
--
There are not enough slots available in the system to satisfy the 2 slots 
that were requested by the application:
  ./mpi-test

Either request fewer slots for your application, or make more slots available
for use.
--
srun: error: 108: task 1: Exited with exit code 1
srun: Terminating job step 131823.4
srun: error: 107: task 0: Exited with exit code 1
srun: Job step aborted
slurmd[108]: *** STEP 131823.4 KILLED AT 2014-02-12T13:30:32 WITH SIGNAL 9 
***


requesting only one core works:

$ mpirun  ./mpi-test 1
4.4.7 20120313 (Red Hat 4.4.7-4):Process 0 on 106 out of 1: 0.00
4.4.7 20120313 (Red Hat 4.4.7-4):Process 0 on 106 out of 1: 0.00


using openmpi-1.6.5 works with multiple cores:

$ mpirun -np 24 ./mpi-test 2
4.4.7 20120313 (Red Hat 4.4.7-4):Process 0 on 106 out of 24: 0.00
4.4.7 20120313 (Red Hat 4.4.7-4):Process 12 on 106 out of 24: 12.00
4.4.7 20120313 (Red Hat 4.4.7-4):Process 11 on 108 out of 24: 11.00
4.4.7 20120313 (Red Hat 4.4.7-4):Process 18 on 106 out of 24: 18.00

$ echo $SLURM_JOB_CPUS_PER_NODE 
8(x3)

I never used slurm before so this could also be a user error on my side.
But as 1.6.5 works it seems something has changed and wanted to let
you know in case it was not intentionally.

Adrian


Re: [OMPI devel] new CRS component added (criu)

2014-02-11 Thread Adrian Reber
On Tue, Feb 11, 2014 at 08:09:35PM +, Jeff Squyres (jsquyres) wrote:
> On Feb 8, 2014, at 4:49 PM, Adrian Reber <adr...@lisas.de> wrote:
> 
> >> I note you have a stray $3 at the end of your configure.m4, too (it might 
> >> supposed to be $2?).
> > 
> > I think I do not really understand configure.m4 and was happy to just
> > copy it from blcr. Especially what $2 and $3 mean and how they are
> > supposed to be used. I will try to simplify my configure.m4. Is there an
> > example which I can have a look at?
> 
> Sorry -- been a bit busy with releasing OMPI 1.7.4 and preparing for 1.7.5...
> 
> m4 is a macro language, so think of it as templates with some intelligence.  
> 
> $1, $2, and $3 are the "parameters" passed in to the macro.  So when you do 
> something like:
> 
> AC_DEFUN([FOO], [
>echo 1 is $1
>echo 2 is $2])
> 
> and you invoke that macro via
> 
>FOO([hello world], [goodbye world])
> 
> the generated script will contain:
> 
>echo 1 is hello world
>echo 2 is goodbye world
> 
> In our case, $1 is the action to execute if the package is happy / wants to 
> build, and $2 is the action to execute if the package is unhappy / does not 
> want to build.
> 
> Meaning: we have a top-level engine that is iterating over all frameworks and 
> components, and calling their *_CONFIG macros with appropriate $1 and $2 
> values that expand to actions-to-execute-if-happy / 
> actions-to-execute-if-unhappy.
> 
> Make sense?

Thanks. I also tried to understand the macros better and with the
generated output and your description I think I understand it.

Trying to simplify configure.m4 like you suggested I would change this:

AS_IF([test "$check_crs_criu_good" != "yes"], [$2],
  [AS_IF([test ! -z "$with_criu" -a "$with_criu" != "yes"],
 [check_crs_criu_dir="$with_criu"
  check_crs_criu_dir_msg="$with_criu (from --with-criu)"])
   AS_IF([test ! -z "$with_criu_libdir" -a "$with_criu_libdir" != 
"yes"],
 [check_crs_criu_libdir="$with_criu_libdir"
  check_crs_criu_libdir_msg="$with_criu_libdir (from 
--with-criu-libdir)"])
  ])

to this:

   AS_IF([test "$check_crs_criu_good" = "yes" -a ! -z "$with_criu" -a 
"$with_criu" != "yes"],
 [check_crs_criu_dir="$with_criu"
  check_crs_criu_dir_msg="$with_criu (from --with-criu)"], 
 [$2
  check_crs_criu_good="no"])

   AS_IF([test "$check_crs_criu_good" = "yes" -a ! -z "$with_criu_libdir" -a 
"$with_criu_libdir" != "yes"],
 [check_crs_criu_dir_libdir="$with_criu_libdir"
  check_crs_criu_dir_libdir_msg="$with_criu_libdir (from --with-criu)"],
 [$2
  check_crs_criu_good="no"])


correct? With three checks in one line it seems bit unreadable
and the nested AS_IF seems easier for me to understand.
Did I understand it correctly what you meant or did you
mean something else?

Adrian


Re: [OMPI devel] new CRS component added (criu)

2014-02-08 Thread Adrian Reber
On Fri, Feb 07, 2014 at 10:08:48PM +, Jeff Squyres (jsquyres) wrote:
> Sweet -- +1 for CRIU support!
> 
> FWIW, I see you modeled your configure.m4 off the blcr configure.m4, but I'd 
> actually go with making it a bit simpler.  For example, I typically structure 
> my configure.m4's like this (typed in mail client -- forgive mistakes...):
> 
> -
>AS_IF([...some test], [crs_criu_happy=1], [crs_criu_happy=0])
># Only bother doing the next test if the previous one passed
>AS_IF([test $crs_criu_happy -eq 1 && ...next test], 
>  [crs_criu_happy=1], [crs_criu_happy=0])
># Only bother doing the next test if the previous one passed
>AS_IF([test $crs_criu_happy -eq 1 && ...next test], 
>  [crs_criu_happy=1], [crs_criu_happy=0])
> 
>...etc...
> 
># Put a single execution of $2 and $3 at the end, depending on how the 
># above tests go.  If a human asked for criu (e.g., --with-criu) and
># we can't find criu support, that's a fatal error.
>AS_IF([test $crs_criu_happy -eq 1],
>  [$2],
>  [AS_IF([test "$with_criu" != "x" && "x$with_criu" != "xno"],
> [AC_MSG_WARN([You asked for CRIU support, but I can't find 
> it.])
>  AC_MSG_ERROR([Cannot continue])],
> [$1])
>   ])
> -
> 
> I note you have a stray $3 at the end of your configure.m4, too (it might 
> supposed to be $2?).

I think I do not really understand configure.m4 and was happy to just
copy it from blcr. Especially what $2 and $3 mean and how they are
supposed to be used. I will try to simplify my configure.m4. Is there an
example which I can have a look at?

> Finally, I note you're looking for libcriu.  Last time I checked with the 
> CRIU guys -- which was quite a while ago -- that didn't exist (but I put in 
> my $0.02 that OMPI would like to see such a userspace library).  I take it 
> that libcriu now exists?

Yes criu has introduced libcriu with the 1.1 release. It is used to
create RPCs to the criu process running as a service. I submitted a few
patches to criu to actually install the headers and libraries and
included it in the Fedora package:

https://admin.fedoraproject.org/updates/criu-1.1-4.fc20

This is what I am currently using to build against criu.

Adrian


[OMPI devel] new CRS component added (criu)

2014-02-07 Thread Adrian Reber
I have created a new CRS component using criu (criu.org) to support
checkpoint/restart in Open MPI. My current patch only provides the
framework and necessary configure scripts to detect and link against
criu. With this patch orte-checkpoint can request a checkpoint and the
new CRIU CRS component is used:

[dcbz:13766] orte_cr: init: orte_cr_init()
[dcbz:13766] crs:criu: opal_crs_criu_prelaunch
[dcbz:13766] crs:criu: opal_crs_criu_prelaunch
[dcbz:13771] opal_cr: init: Verbose Level: 30
[dcbz:13771] opal_cr: init: FT Enabled: true
[dcbz:13771] opal_cr: init: Is a tool program: false
[dcbz:13771] opal_cr: init: Debug SIGPIPE: 30 (False)
[dcbz:13771] opal_cr: init: Checkpoint Signal: 10
[dcbz:13771] opal_cr: init: FT Use thread: true
[dcbz:13771] opal_cr: init: FT thread sleep: check = 0, wait = 100
[dcbz:13771] opal_cr: init: C/R Debugging Enabled [False]
[dcbz:13771] opal_cr: init: Checkpoint Signal (Debug): 20
[dcbz:13771] opal_cr: init: Temp Directory: /tmp
...
[dcbz:13772] orte_cr: coord: orte_cr_coord(Checkpoint)
[dcbz:13772] orte_cr: coord_pre_ckpt: orte_cr_coord_pre_ckpt()
[dcbz:13772] orte_cr: coord_post_ckpt: orte_cr_coord_post_ckpt()
[dcbz:13772] ompi_cr: coord_post_ckpt: ompi_cr_coord_post_ckpt()
[dcbz:13772] opal_cr: opal_cr_inc_core_ckpt: Take the checkpoint.
[dcbz:13772] crs:criu: checkpoint(13772, ---)
[dcbz:13772] crs:criu: criu_init_opts() returned 0
[dcbz:13771] orte_cr: coord_post_ckpt: orte_cr_coord_post_ckpt()
[dcbz:13771] ompi_cr: coord_post_ckpt: ompi_cr_coord_post_ckpt()
[dcbz:13771] opal_cr: opal_cr_inc_core_ckpt: Take the checkpoint.
[dcbz:13771] crs:criu: checkpoint(13771, ---)
[dcbz:13771] crs:criu: criu_init_opts() returned 0
...
[dcbz:13766] 13766: Checkpoint established for process [55729,0].
[dcbz:13771] ompi_cr: coord: ompi_cr_coord(Running)
[dcbz:13771] orte_cr: coord: orte_cr_coord(Running)
[dcbz:13766] 13766: Successfully restarted process [55729,0].
[dcbz:13772] ompi_cr: coord: ompi_cr_coord(Running)
[dcbz:13772] orte_cr: coord: orte_cr_coord(Running)

It seems the C/R code basically works again and now needs to be filled
with the actual code to take checkpoints using criu.

The patch I want to check in is available at:

https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=7e0c7c940705cc572242097ff53f9e0ee6db11ea

The patch only creates files in opal/mca/crs/criu and does not touch any
other code.

Adrian


Re: [OMPI devel] C/R and orte_oob

2014-02-06 Thread Adrian Reber
Josh explained it to me a few days ago, that after a checkpoint has been
received TCP should no longer be used to not lose any messages. The
communication happens over named pipes and therefore (I think) OOB
ft_event() is used to quite anything besides the pipes. This all seems
to work but I was just confused as the functions for ft_event()
in oob/tcp and oob/ud do not seem to contain any functionality.

So do I try to fix the ft_event() function in oob/base/ to call the
registered ft_event() function which does nothing or do I just remove
the call to orte oob ft_event().

On Thu, Feb 06, 2014 at 10:49:25AM -0800, Ralph Castain wrote:
> The only reason I can think of for an OOB ft-event would be to tell the OOB 
> to stop sending any messages. You would need to push that into the event 
> library and use a callback event to let you know when it was done.
> 
> Of course, once you did that, the OOB would no longer be available to, for 
> example, tell the local daemon that the app is ready for checkpoint :-)
> 
> Afraid I'll have to defer to Josh H for any further guidance.
> 
> 
> On Feb 6, 2014, at 8:15 AM, Adrian Reber <adr...@lisas.de> wrote:
> 
> > When I initially made the C/R code compile again I made following
> > change:
> > 
> > diff --git a/orte/mca/rml/oob/rml_oob_component.c 
> > b/orte/mca/rml/oob/rml_oob_component.c
> > index f0b22fc..90ed086 100644
> > --- a/orte/mca/rml/oob/rml_oob_component.c
> > +++ b/orte/mca/rml/oob/rml_oob_component.c
> > @@ -185,8 +185,7 @@ orte_rml_oob_ft_event(int state) {
> > ;
> > }
> > 
> > -if( ORTE_SUCCESS != 
> > -(ret = orte_oob.ft_event(state)) ) {
> > +if( ORTE_SUCCESS != (ret = orte_rml_oob_ft_event(state)) ) {
> > ORTE_ERROR_LOG(ret);
> > exit_status = ret;
> > goto cleanup;
> > 
> > 
> > 
> > This is, of course, wrong. Now the function calls itself in a loop until
> > it crashes. Looking at orte/mca/oob there is still a ft_event()
> > function, but it is disabled using "#if 0". Looking at other functions
> > it seems I would need to create something like
> > 
> > #define ORTE_OOB_FT_EVENT(m)
> > 
> > Looking at the modules in orte/mca/oob/ it seems ft_event is implemented
> > in some places but it never seems to have any real functionality. Is
> > ft_event() actually needed there?
> > 
> > Adrian
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel


[OMPI devel] C/R and orte_oob

2014-02-06 Thread Adrian Reber
When I initially made the C/R code compile again I made following
change:

diff --git a/orte/mca/rml/oob/rml_oob_component.c 
b/orte/mca/rml/oob/rml_oob_component.c
index f0b22fc..90ed086 100644
--- a/orte/mca/rml/oob/rml_oob_component.c
+++ b/orte/mca/rml/oob/rml_oob_component.c
@@ -185,8 +185,7 @@ orte_rml_oob_ft_event(int state) {
 ;
 }

-if( ORTE_SUCCESS != 
-(ret = orte_oob.ft_event(state)) ) {
+if( ORTE_SUCCESS != (ret = orte_rml_oob_ft_event(state)) ) {
 ORTE_ERROR_LOG(ret);
 exit_status = ret;
 goto cleanup;



This is, of course, wrong. Now the function calls itself in a loop until
it crashes. Looking at orte/mca/oob there is still a ft_event()
function, but it is disabled using "#if 0". Looking at other functions
it seems I would need to create something like

#define ORTE_OOB_FT_EVENT(m)

Looking at the modules in orte/mca/oob/ it seems ft_event is implemented
in some places but it never seems to have any real functionality. Is
ft_event() actually needed there?

Adrian


Re: [OMPI devel] Use unique collective ids for the checkpoint/restart code

2014-02-04 Thread Adrian Reber
Thanks for spotting the 'printf'. I removed it as it was for debugging
in a very early stage. I committed the patch without the 'printf' to svn.

Adrian

On Mon, Feb 03, 2014 at 12:42:39PM -0800, Ralph Castain wrote:
> Looks okay to me - I see you left a "printf" statement in 
> plm_base_launch_support.c, so you might want to make that an 
> opal_output_verbose or something.
> 
> On Feb 3, 2014, at 12:19 PM, Adrian Reber <adr...@lisas.de> wrote:
> 
> > This patch
> > 
> > https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=14ec7f42baab882e345948ff79c4f75f5084bbbf
> > 
> > introduces unique collective ids for the checkpoint/restart code and
> > with this applied it seems to work pretty good. As this patch also
> > touches non-CR code it would be good if someone could have a look at it.
> > 
> > With this patch applied the code seems to work up to the point where
> > orterun actually pauses all processes and tries to create the
> > checkpoints. The checkpoint creation does not work for me as CRS does
> > not yet include support for checkpoint/restart using CRIU which would be
> > my next step.
> > 
> > Adrian
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



[OMPI devel] Use unique collective ids for the checkpoint/restart code

2014-02-03 Thread Adrian Reber
This patch

https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=14ec7f42baab882e345948ff79c4f75f5084bbbf

introduces unique collective ids for the checkpoint/restart code and
with this applied it seems to work pretty good. As this patch also
touches non-CR code it would be good if someone could have a look at it.

With this patch applied the code seems to work up to the point where
orterun actually pauses all processes and tries to create the
checkpoints. The checkpoint creation does not work for me as CRS does
not yet include support for checkpoint/restart using CRIU which would be
my next step.

Adrian


Re: [OMPI devel] SNAPC: dynamic send buffers

2014-01-29 Thread Adrian Reber
Thanks for pointing out orte_rml_recv_callback(). It does just what I
need. I removed my own callback and I am now using orte_rml_recv_callback()

I have extended the patches to fix the usage of static buffers
in SNAPC and SSTORE as well as removing all remaining occurrences
of TODOs in my 'getting-it-compiled-again' patches. The following
patches are ready to be committed:

2c69cdb SNAPC/CRCP/SSTORE: remove compiler warnings
 
https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=2c69cdbf3ab9ebcb8c05540ed8807faa3db25203

e60592b SNAPC: use ORTE_WAIT_FOR_COMPLETION with non-blocking receives
 
https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=e60592b629a8328538a2d752e0ec4b639a125465
 

17147ae SSTORE/CRCP: use ORTE_WAIT_FOR_COMPLETION with non-blocking receives
 
https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=17147aeb4b9b9d20133be1807ee3369c788fe923
 

ea3891e SSTORE: use dynamic buffers for rml.send and rml.recv
 
https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=ea3891ef9d095cfa40ade03fd676a1d61c932e5f
 

02c05d2 SNAPC: use dynamic buffers for rml.send and rml.recv
 
https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=02c05d2685dc111919c63936acdaf4a594da0fa0
 


On Tue, Jan 28, 2014 at 08:01:53AM -0800, Ralph Castain wrote:
> This looks okay to me. Couple of comments:
> 
> 1. if you don't want to create your own callback function, you can use the 
> standard one. It does more than you need, but won't hurt anything:
> 
> ORTE_DECLSPEC void orte_rml_recv_callback(int status, orte_process_name_t* 
> sender,
>   opal_buffer_t *buffer,
>   orte_rml_tag_t tag, void *cbdata);
> 
> The code is in orte/mca/rml/base/rml_base_frame.c
> 
> 2. be aware that ORTE_WAIT_FOR_COMPLETION will block if you are in an RML 
> callback. I don't think that's an issue here, but just wanted to point it out.
> 
> Ralph
> 
> On Jan 27, 2014, at 8:12 AM, Adrian Reber <adr...@lisas.de> wrote:
> 
> > I have the following patches which I would like to commit. All changes
> > are in the SNAPC component. The first patch replaces all statically
> > allocated buffers with dynamically allocate buffers. The second patch
> > removes compiler warnings and the last patch tries to re-introduce
> > functionality which I removed with my 'getting-it-compiled-again'
> > patches. Instead of blocking recv() calls it now uses
> > ORTE_WAIT_FOR_COMPLETION(). I included gitweb links to the patches.
> > 
> > Please have a look at the patches.
> > 
> > Adrian
> > 
> > commit 6f10b44499b59c84d9032378c7f8c6b3526a029b
> > Author: Adrian Reber <adrian.re...@hs-esslingen.de>
> > Date:   Sun Jan 26 12:10:41 2014 +0100
> > 
> >SNAPC: use dynamic buffers for rml.send and rml.recv
> > 
> >The snapc component was still using static buffers
> >for send_buffer_nb(). This patch changes opal_buffer_t buffer;
> >to opal_buffer_t *buffer;
> > 
> > orte/mca/snapc/full/snapc_full_app.c| 119 
> > +++---
> > orte/mca/snapc/full/snapc_full_global.c |  73 
> > 
> > orte/mca/snapc/full/snapc_full_local.c  |  33 
> > +++--
> > 3 files changed, 114 insertions(+), 111 deletions(-)
> > 
> >  
> > https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=6f10b44499b59c84d9032378c7f8c6b3526a029b
> > 
> > commit 218d04ad663ad76ad23cd99b62e83c435ccfe418
> > Author: Adrian Reber <adrian.re...@hs-esslingen.de>
> > Date:   Mon Jan 27 12:49:30 2014 +0100
> > 
> >SNAPC: remove compiler warnings
> > 
> > orte/mca/snapc/full/snapc_full_global.c | 19 +------
> > orte/mca/snapc/full/snapc_full_local.c  | 29 ++---
> > 2 files changed, 11 insertions(+), 37 deletions(-)
> > 
> >  
> > https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=218d04ad663ad76ad23cd99b62e83c435ccfe418
> > 
> > commit 67d435cbe5df5c59519d605ce25443880244d2d5
> > Author: Adrian Reber <adrian.re...@hs-esslingen.de>
> > Date:   Mon Jan 27 14:31:36 2014 +0100
> > 
> >use ORTE_WAIT_FOR_COMPLETION with non-blocking receives
> > 
> >During the commits to make the C/R code compile again the
> >blocking receive calls in snapc_full_app.c were
> >replaced by non-blocking receive calls with a dummy callback
> >function. This commit adds ORTE_WAIT_FOR_COMPLETION()
> >after each non-blocking receive to wait for the data.
> > 
> > orte/mca/snapc/full/snapc_full_app.c | 56 
> > +---
> > 1 file changed, 17 insertions(+), 39 deletions(-)
> > 
> >  
> > https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=67d435cbe5df5c59519d605ce25443880244d2d5
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel


[OMPI devel] SNAPC: dynamic send buffers

2014-01-27 Thread Adrian Reber
I have the following patches which I would like to commit. All changes
are in the SNAPC component. The first patch replaces all statically
allocated buffers with dynamically allocate buffers. The second patch
removes compiler warnings and the last patch tries to re-introduce
functionality which I removed with my 'getting-it-compiled-again'
patches. Instead of blocking recv() calls it now uses
ORTE_WAIT_FOR_COMPLETION(). I included gitweb links to the patches.

Please have a look at the patches.

Adrian

commit 6f10b44499b59c84d9032378c7f8c6b3526a029b
Author: Adrian Reber <adrian.re...@hs-esslingen.de>
List-Post: devel@lists.open-mpi.org
Date:   Sun Jan 26 12:10:41 2014 +0100

SNAPC: use dynamic buffers for rml.send and rml.recv

The snapc component was still using static buffers
for send_buffer_nb(). This patch changes opal_buffer_t buffer;
to opal_buffer_t *buffer;

 orte/mca/snapc/full/snapc_full_app.c| 119 
+++---
 orte/mca/snapc/full/snapc_full_global.c |  73 

 orte/mca/snapc/full/snapc_full_local.c  |  33 +++--
 3 files changed, 114 insertions(+), 111 deletions(-)

  
https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=6f10b44499b59c84d9032378c7f8c6b3526a029b

commit 218d04ad663ad76ad23cd99b62e83c435ccfe418
Author: Adrian Reber <adrian.re...@hs-esslingen.de>
List-Post: devel@lists.open-mpi.org
Date:   Mon Jan 27 12:49:30 2014 +0100

SNAPC: remove compiler warnings

 orte/mca/snapc/full/snapc_full_global.c | 19 +--
 orte/mca/snapc/full/snapc_full_local.c  | 29 ++---
 2 files changed, 11 insertions(+), 37 deletions(-)

  
https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=218d04ad663ad76ad23cd99b62e83c435ccfe418

commit 67d435cbe5df5c59519d605ce25443880244d2d5
Author: Adrian Reber <adrian.re...@hs-esslingen.de>
List-Post: devel@lists.open-mpi.org
Date:   Mon Jan 27 14:31:36 2014 +0100

use ORTE_WAIT_FOR_COMPLETION with non-blocking receives

During the commits to make the C/R code compile again the
blocking receive calls in snapc_full_app.c were
replaced by non-blocking receive calls with a dummy callback
function. This commit adds ORTE_WAIT_FOR_COMPLETION()
after each non-blocking receive to wait for the data.

 orte/mca/snapc/full/snapc_full_app.c | 56 
+---
 1 file changed, 17 insertions(+), 39 deletions(-)

  
https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=67d435cbe5df5c59519d605ce25443880244d2d5


Re: [OMPI devel] [PATCH] make orte-checkpoint communicate with orterun again

2014-01-24 Thread Adrian Reber
Status update of C/R with Open MPI:

With the last two patches applied I am now seeing communication
between orte-checkpoint and orterun:

orte-checkpoint 23975:

[dcbz:23986] orte_checkpoint: Checkpointing...
[dcbz:23986] PID 23975
[dcbz:23986] Connected to Mpirun [[45520,0],0]
[dcbz:23986] orte_checkpoint: notify_hnp: Contact Head Node Process PID 23975
[dcbz:23986] [[45509,0],0] rml_send_buffer to peer [[45520,0],0] at tag 13
[dcbz:23986] orte_checkpoint: notify_hnp: Requested a checkpoint of jobid 
[INVALID]
[dcbz:23986] [[45509,0],0] posting recv
[dcbz:23986] [[45509,0],0] posting persistent recv on tag 9 for peer 
[[WILDCARD],WILDCARD]
[dcbz:23986] [[45509,0],0] posting recv
[dcbz:23986] [[45509,0],0] posting persistent recv on tag 13 for peer 
[[WILDCARD],WILDCARD]
[dcbz:23986] [[45509,0],0] rml_send_msg to peer [[45520,0],0] at tag 13
[dcbz:23986] [[45509,0],0]-[[45520,0],0] Send message complete at 
../../../../../orte/mca/oob/tcp/oob_tcp_sendrecv.c:220
[dcbz:23986] [[45509,0],0] Message posted at 
../../../../../orte/mca/oob/tcp/oob_tcp_sendrecv.c:519
[dcbz:23986] [[45509,0],0] message received 39 bytes from [[45520,0],0] for tag 
13
[dcbz:23986] orte_checkpoint: hnp_receiver: Receive a command message.
[dcbz:23986] orte_checkpoint: hnp_receiver: Status Update.
--
Error: The application (PID = 23975) failed to checkpoint properly.
   Returned -1.
--

orterun:

[dcbz:23975] [[45520,0],0] Message posted at 
../../../../../orte/mca/oob/tcp/oob_tcp_sendrecv.c:519
[dcbz:23975] [[45520,0],0] message received 50 bytes from [[45509,0],0] for tag 
13
[dcbz:23975] Global) Command Line: Start a checkpoint operation [Sender = 
[[45509,0],0]]
[dcbz:23975] Global) Command line requested a checkpoint [command 1]
[dcbz:23975] Global-Local) base:ckpt_init_cmd: Receiving commands
[dcbz:23975] Global-Local) base:ckpt_init_cmd: Received [0, 0, [INVALID]]
[dcbz:23975] Global) request_cmd(): Checkpointing currently disabled, rejecting 
request
[dcbz:23975] 23975: Failed to checkpoint process [45520,0].
[dcbz:23975] Global-Local) base:ckpt_update_cmd: Sending update command 
[dcbz:23975] Global-Local) base:ckpt_update_cmd: Sending update command  +  
[dcbz:23975] [[45520,0],0] rml_send_buffer to peer [[45509,0],0] at tag 13
[dcbz:23975] Global) Startup Command Line Channel
[dcbz:23975] [[45520,0],0] rml_recv_buffer_nb for peer [[WILDCARD],WILDCARD] 
tag 13
[dcbz:23975] [[45520,0],0] rml_send_msg to peer [[45509,0],0] at tag 13
[dcbz:23975] [[45520,0],0] posting recv
[dcbz:23975] [[45520,0],0] posting non-persistent recv on tag 13 for peer 
[[WILDCARD],WILDCARD]
[dcbz:23975] [[45520,0],0]-[[45509,0],0] Send message complete at 
../../../../../orte/mca/oob/tcp/oob_tcp_sendrecv.c:220

It's still not working but at least both processes are
talking to each other which is good.

Adrian


On Thu, Jan 23, 2014 at 11:27:42AM -0600, Josh Hursey wrote:
> +1
> 
> 
> On Thu, Jan 23, 2014 at 10:16 AM, Ralph Castain <r...@open-mpi.org> wrote:
> 
> > Looks correct to me - you are right in that you cannot release the buffer
> > until after the send completes. We don't copy the data underneath to save
> > memory and time.
> >
> >
> > On Jan 23, 2014, at 6:51 AM, Adrian Reber <adr...@lisas.de> wrote:
> >
> > > Following patch makes orte-checkpoint communicate with orterun again:
> > >
> > > diff --git a/orte/tools/orte-checkpoint/orte-checkpoint.c
> > b/orte/tools/orte-checkpoint/orte-checkpoint.c
> > > index 7106342..8539f34 100644
> > > --- a/orte/tools/orte-checkpoint/orte-checkpoint.c
> > > +++ b/orte/tools/orte-checkpoint/orte-checkpoint.c
> > > @@ -834,7 +834,7 @@ static int
> > notify_process_for_checkpoint(opal_crs_base_ckpt_options_t *options)
> > > }
> > >
> > > if (ORTE_SUCCESS != (ret =
> > orte_rml.send_buffer_nb(&(orterun_hnp->name), buffer,
> > > -
> > ORTE_RML_TAG_CKPT, hnp_receiver,
> > > +
> > ORTE_RML_TAG_CKPT, orte_rml_send_callback,
> > >NULL))) {
> > > exit_status = ret;
> > > goto cleanup;
> > > @@ -845,11 +845,6 @@ static int
> > notify_process_for_checkpoint(opal_crs_base_ckpt_options_t *options)
> > > ORTE_JOBID_PRINT(jobid));
> > >
> > >  cleanup:
> > > -if( NULL != buffer) {
> > > -OBJ_RELEASE(buffer);
> > > -buffer = NULL;
> > > -}
> > > -
> > > if( ORTE_SUCCESS != exit_status ) {
> > > opal_show_help("help-orte-checkpoint.txt", "una

[OMPI devel] [PATCH] use ORTE_PROC_IS_APP

2014-01-23 Thread Adrian Reber
Selecting SNAPC requires the information if it is an app or not:

int orte_snapc_base_select(bool seed, bool app);

The following patch uses the correct define. Can I commit it like this:

t a/orte/mca/ess/base/ess_base_std_app.c b/orte/mca/ess/base/ess_base_std_app.c
index dbbb2f4..f3a38f0 100644
--- a/orte/mca/ess/base/ess_base_std_app.c
+++ b/orte/mca/ess/base/ess_base_std_app.c
@@ -252,7 +252,7 @@ int orte_ess_base_app_setup(bool db_restrict_local)
 error = "orte_sstore_base_open";
 goto error;
 }
-if (ORTE_SUCCESS != (ret = orte_snapc_base_select(ORTE_PROC_IS_HNP, 
!ORTE_PROC_IS_DAEMON))) {
+if (ORTE_SUCCESS != (ret = orte_snapc_base_select(ORTE_PROC_IS_HNP, 
ORTE_PROC_IS_APP))) {
 ORTE_ERROR_LOG(ret);
 error = "orte_snapc_base_select";
 goto error;
diff --git a/orte/mca/ess/base/ess_base_std_tool.c 
b/orte/mca/ess/base/ess_base_std_tool.c
index 98c1685..7fcf83d 100644
--- a/orte/mca/ess/base/ess_base_std_tool.c
+++ b/orte/mca/ess/base/ess_base_std_tool.c
@@ -189,7 +189,7 @@ int orte_ess_base_tool_setup(void)
 error = "orte_snapc_base_open";
 goto error;
 }
-if (ORTE_SUCCESS != (ret = orte_snapc_base_select(ORTE_PROC_IS_HNP, 
!ORTE_PROC_IS_DAEMON))) {
+if (ORTE_SUCCESS != (ret = orte_snapc_base_select(ORTE_PROC_IS_HNP, 
ORTE_PROC_IS_APP))) {
 ORTE_ERROR_LOG(ret);
 error = "orte_snapc_base_select";
 goto error;
diff --git a/orte/mca/ess/hnp/ess_hnp_module.c 
b/orte/mca/ess/hnp/ess_hnp_module.c
index a6f1777..ea444c4 100644
--- a/orte/mca/ess/hnp/ess_hnp_module.c
+++ b/orte/mca/ess/hnp/ess_hnp_module.c
@@ -678,7 +678,7 @@ static int rte_init(void)
 error = "orte_sstore_base_open";
 goto error;
 }
-if (ORTE_SUCCESS != (ret = orte_snapc_base_select(ORTE_PROC_IS_HNP, 
!ORTE_PROC_IS_DAEMON))) {
+if (ORTE_SUCCESS != (ret = orte_snapc_base_select(ORTE_PROC_IS_HNP, 
ORTE_PROC_IS_APP))) {
 ORTE_ERROR_LOG(ret);
 error = "orte_snapc_base_select";
 goto error;


[OMPI devel] [PATCH] make orte-checkpoint communicate with orterun again

2014-01-23 Thread Adrian Reber
Following patch makes orte-checkpoint communicate with orterun again:

diff --git a/orte/tools/orte-checkpoint/orte-checkpoint.c 
b/orte/tools/orte-checkpoint/orte-checkpoint.c
index 7106342..8539f34 100644
--- a/orte/tools/orte-checkpoint/orte-checkpoint.c
+++ b/orte/tools/orte-checkpoint/orte-checkpoint.c
@@ -834,7 +834,7 @@ static int 
notify_process_for_checkpoint(opal_crs_base_ckpt_options_t *options)
 }

 if (ORTE_SUCCESS != (ret = orte_rml.send_buffer_nb(&(orterun_hnp->name), 
buffer,
-   ORTE_RML_TAG_CKPT, 
hnp_receiver,
+   ORTE_RML_TAG_CKPT, 
orte_rml_send_callback,
NULL))) {
 exit_status = ret;
 goto cleanup;
@@ -845,11 +845,6 @@ static int 
notify_process_for_checkpoint(opal_crs_base_ckpt_options_t *options)
 ORTE_JOBID_PRINT(jobid));

  cleanup:
-if( NULL != buffer) {
-OBJ_RELEASE(buffer);
-buffer = NULL;
-}
-
 if( ORTE_SUCCESS != exit_status ) {
 opal_show_help("help-orte-checkpoint.txt", "unable_to_connect", true,
orte_checkpoint_globals.pid);


Before committing the code into the repository I wanted to make
sure it is the correct way to fix it.

The first change changes the callback to orte_rml_send_callback().
When I initially made the code compile again I used hnp_receiver()
to change the code from blocking to non-blocking and that was
wrong.

The second change (removal of OBJ_RELEASE(buffer)) is necessary
because this seems to delete buffer during communication and then
everything breaks badly.

Adrian


Re: [OMPI devel] callback debugging

2014-01-21 Thread Adrian Reber
Thanks, that helps. Now it actually starts to communicate with the
orterun process. This still fails but I will try to fix it.

On Tue, Jan 21, 2014 at 12:27:55PM -0800, Ralph Castain wrote:
> That second argument is incorrect - it should be ORTE_PROC_IS_APP (note no 
> !). The problem is that orte-checkpoint is a tool, and so it isn't a daemon - 
> but it is also not an app.
> 
> 
> On Jan 21, 2014, at 11:56 AM, Adrian Reber <adr...@lisas.de> wrote:
> 
> > Good to know that it does not make any sense. So it not just me.
> > 
> > Looking at the call chain I can see
> > 
> > orte_snapc_base_select(ORTE_PROC_IS_HNP, !ORTE_PROC_IS_DAEMON);
> > 
> > and the second parameter is used to decide if it is an app or not:
> > 
> > int orte_snapc_base_select(bool seed, bool app) in 
> > orte/mca/snapc/base/snapc_base_select.c
> > 
> > and if it is true the code with the barrier is used.
> > 
> > In orte/mca/snapc/base/snapc_base_select.c there is also following
> > comment:
> > 
> > /* XXX -- TODO -- framework_subsytem -- this shouldn't be necessary once 
> > the framework system is in place */
> > 
> > Is this something which needs to be changed and which might be the cause
> > for this problem?
> > 
> > 
> > On Tue, Jan 21, 2014 at 07:27:32AM -0800, Ralph Castain wrote:
> >> That doesn't make any sense - I can't imagine a reason for orte-checkpoint 
> >> itself to be running a barrier. I wonder if it is selecting the wrong 
> >> component in snapc?
> >> 
> >> As for the patch, that isn't going to work. The collective id has to be 
> >> *globally* unique, which means that only orterun can issue a new one. So 
> >> you have to get thru orte_init before you can request one as it requires a 
> >> communication.
> >> 
> >> However, like I said, it makes no sense for orte-checkpoint to do a 
> >> barrier as it is a singleton - there is nothing for it to "barrier" with.
> >> 
> >> On Jan 21, 2014, at 7:24 AM, Adrian Reber <adr...@lisas.de> wrote:
> >> 
> >>> I think I still do not really understand how it works.
> >>> 
> >>> The barrier on which orte-checkpoint is currently hanging is in
> >>> app_coord_init(). You are also saying that orte-checkpoint
> >>> should not be calling a barrier. The backtrace of the point where it
> >>> is hanging now looks like:
> >>> 
> >>> #0  0x769befa0 in __nanosleep_nocancel () at 
> >>> ../sysdeps/unix/syscall-template.S:81
> >>> #1  0x77b45712 in app_coord_init () at 
> >>> ../../../../../orte/mca/snapc/full/snapc_full_app.c:208
> >>> #2  0x77b3a5ce in orte_snapc_full_module_init (seed=false, 
> >>> app=true) at ../../../../../orte/mca/snapc/full/snapc_full_module.c:207
> >>> #3  0x77b375de in orte_snapc_base_select (seed=false, app=true) 
> >>> at ../../../../orte/mca/snapc/base/snapc_base_select.c:96
> >>> #4  0x77a9884a in orte_ess_base_tool_setup () at 
> >>> ../../../../orte/mca/ess/base/ess_base_std_tool.c:192
> >>> #5  0x77a9fe85 in rte_init () at 
> >>> ../../../../../orte/mca/ess/tool/ess_tool_module.c:83
> >>> #6  0x77a4647f in orte_init (pargc=0x7fffd94c, 
> >>> pargv=0x7fffd940, flags=8) at ../../orte/runtime/orte_init.c:158
> >>> #7  0x00402859 in ckpt_init (argc=51, argv=0x7fffda78) at 
> >>> ../../../../orte/tools/orte-checkpoint/orte-checkpoint.c:610
> >>> #8  0x00401d7a in main (argc=51, argv=0x7fffda78) at 
> >>> ../../../../orte/tools/orte-checkpoint/orte-checkpoint.c:245
> >>> 
> >>> Maybe I am doing something completely wrong. I am currently
> >>> running 'orterun -np 2 test-programm'.
> >>> 
> >>> In another terminal I am starting orte-checkpoint with the PID of
> >>> orterun and the barrier in app_coord_init() is just before it tries
> >>> to communicate with orterun. Is this the correct setup?
> >>> 
> >>>   Adrian
> >>> 
> >>> On Mon, Jan 20, 2014 at 05:33:59PM -0600, Josh Hursey wrote:
> >>>> If it is the application, then there is probably a barrier in the
> >>>> app_coord_init() to make sure all the applications are up and running.
> >>>> After this point then the global coordinator knows that the application 
> >>>> can
> >>>> be checkpointed.
> >

Re: [OMPI devel] callback debugging

2014-01-21 Thread Adrian Reber
Good to know that it does not make any sense. So it not just me.

Looking at the call chain I can see

orte_snapc_base_select(ORTE_PROC_IS_HNP, !ORTE_PROC_IS_DAEMON);

and the second parameter is used to decide if it is an app or not:

int orte_snapc_base_select(bool seed, bool app) in 
orte/mca/snapc/base/snapc_base_select.c

and if it is true the code with the barrier is used.

In orte/mca/snapc/base/snapc_base_select.c there is also following
comment:

/* XXX -- TODO -- framework_subsytem -- this shouldn't be necessary once the 
framework system is in place */

Is this something which needs to be changed and which might be the cause
for this problem?


On Tue, Jan 21, 2014 at 07:27:32AM -0800, Ralph Castain wrote:
> That doesn't make any sense - I can't imagine a reason for orte-checkpoint 
> itself to be running a barrier. I wonder if it is selecting the wrong 
> component in snapc?
> 
> As for the patch, that isn't going to work. The collective id has to be 
> *globally* unique, which means that only orterun can issue a new one. So you 
> have to get thru orte_init before you can request one as it requires a 
> communication.
> 
> However, like I said, it makes no sense for orte-checkpoint to do a barrier 
> as it is a singleton - there is nothing for it to "barrier" with.
> 
> On Jan 21, 2014, at 7:24 AM, Adrian Reber <adr...@lisas.de> wrote:
> 
> > I think I still do not really understand how it works.
> > 
> > The barrier on which orte-checkpoint is currently hanging is in
> > app_coord_init(). You are also saying that orte-checkpoint
> > should not be calling a barrier. The backtrace of the point where it
> > is hanging now looks like:
> > 
> > #0  0x769befa0 in __nanosleep_nocancel () at 
> > ../sysdeps/unix/syscall-template.S:81
> > #1  0x77b45712 in app_coord_init () at 
> > ../../../../../orte/mca/snapc/full/snapc_full_app.c:208
> > #2  0x77b3a5ce in orte_snapc_full_module_init (seed=false, 
> > app=true) at ../../../../../orte/mca/snapc/full/snapc_full_module.c:207
> > #3  0x77b375de in orte_snapc_base_select (seed=false, app=true) at 
> > ../../../../orte/mca/snapc/base/snapc_base_select.c:96
> > #4  0x77a9884a in orte_ess_base_tool_setup () at 
> > ../../../../orte/mca/ess/base/ess_base_std_tool.c:192
> > #5  0x77a9fe85 in rte_init () at 
> > ../../../../../orte/mca/ess/tool/ess_tool_module.c:83
> > #6  0x77a4647f in orte_init (pargc=0x7fffd94c, 
> > pargv=0x7fffd940, flags=8) at ../../orte/runtime/orte_init.c:158
> > #7  0x00402859 in ckpt_init (argc=51, argv=0x7fffda78) at 
> > ../../../../orte/tools/orte-checkpoint/orte-checkpoint.c:610
> > #8  0x00401d7a in main (argc=51, argv=0x7fffda78) at 
> > ../../../../orte/tools/orte-checkpoint/orte-checkpoint.c:245
> > 
> > Maybe I am doing something completely wrong. I am currently
> > running 'orterun -np 2 test-programm'.
> > 
> > In another terminal I am starting orte-checkpoint with the PID of
> > orterun and the barrier in app_coord_init() is just before it tries
> > to communicate with orterun. Is this the correct setup?
> > 
> > Adrian
> > 
> > On Mon, Jan 20, 2014 at 05:33:59PM -0600, Josh Hursey wrote:
> >> If it is the application, then there is probably a barrier in the
> >> app_coord_init() to make sure all the applications are up and running.
> >> After this point then the global coordinator knows that the application can
> >> be checkpointed.
> >> 
> >> I don't think orte-checkpoint should be calling a barrier - from what I
> >> recall.
> >> 
> >> 
> >> On Mon, Jan 20, 2014 at 4:46 PM, Ralph Castain <r...@open-mpi.org> wrote:
> >> 
> >>> Is it orte-checkpoint that is hanging, or the app you are trying to
> >>> checkpoint?
> >>> 
> >>> 
> >>> On Jan 20, 2014, at 2:10 PM, Adrian Reber <adr...@lisas.de> wrote:
> >>> 
> >>> Thanks for your help. I tried initializing the barrier correctly (see
> >>> attached patch) but now, instead of crashing, it just hangs on the
> >>> barrier while running orte-checkpoint
> >>> 
> >>> [dcbz:20150] [[41665,0],0] grpcomm:bad entering barrier
> >>> [dcbz:20150] [[41665,0],0] ACTIVATING GRCPCOMM OP 0 at
> >>> ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c:206
> >>> 
> >>> #0  0x769befa0 in __nanosleep_nocancel () at
> >>> ../sysdeps/unix/syscall-template.S:81
> >>> #1  0x77b456ba in app_coord_init () 

Re: [OMPI devel] callback debugging

2014-01-21 Thread Adrian Reber
orte-checkpoint before communicating with orterun which runs the
processes I am trying to checkpoint. The full backtrace:

#0  0x769befa0 in __nanosleep_nocancel () at 
../sysdeps/unix/syscall-template.S:81
#1  0x77b45712 in app_coord_init () at 
../../../../../orte/mca/snapc/full/snapc_full_app.c:208
#2  0x77b3a5ce in orte_snapc_full_module_init (seed=false, app=true) at 
../../../../../orte/mca/snapc/full/snapc_full_module.c:207
#3  0x77b375de in orte_snapc_base_select (seed=false, app=true) at 
../../../../orte/mca/snapc/base/snapc_base_select.c:96
#4  0x77a9884a in orte_ess_base_tool_setup () at 
../../../../orte/mca/ess/base/ess_base_std_tool.c:192
#5  0x77a9fe85 in rte_init () at 
../../../../../orte/mca/ess/tool/ess_tool_module.c:83
#6  0x77a4647f in orte_init (pargc=0x7fffd94c, 
pargv=0x7fffd940, flags=8) at ../../orte/runtime/orte_init.c:158
#7  0x00402859 in ckpt_init (argc=51, argv=0x7fffda78) at 
../../../../orte/tools/orte-checkpoint/orte-checkpoint.c:610
#8  0x00401d7a in main (argc=51, argv=0x7fffda78) at 
../../../../orte/tools/orte-checkpoint/orte-checkpoint.c:245


On Mon, Jan 20, 2014 at 02:46:04PM -0800, Ralph Castain wrote:
> Is it orte-checkpoint that is hanging, or the app you are trying to 
> checkpoint?
> 
> 
> On Jan 20, 2014, at 2:10 PM, Adrian Reber <adr...@lisas.de> wrote:
> 
> > Thanks for your help. I tried initializing the barrier correctly (see
> > attached patch) but now, instead of crashing, it just hangs on the
> > barrier while running orte-checkpoint
> > 
> > [dcbz:20150] [[41665,0],0] grpcomm:bad entering barrier
> > [dcbz:20150] [[41665,0],0] ACTIVATING GRCPCOMM OP 0 at 
> > ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c:206
> > 
> > #0  0x769befa0 in __nanosleep_nocancel () at 
> > ../sysdeps/unix/syscall-template.S:81
> > #1  0x77b456ba in app_coord_init () at 
> > ../../../../../orte/mca/snapc/full/snapc_full_app.c:207
> > #2  0x77b3a582 in orte_snapc_full_module_init (seed=false, 
> > app=true) at ../../../../../orte/mca/snapc/full/snapc_full_module.c:207
> > 
> > it hangs looping at ORTE_WAIT_FOR_COMPLETION(coll->active);
> > 
> > I do not understand on what the barrier here is actually waiting for. Where
> > do I need to look to find the place the barrier is waiting for?
> > 
> > I also tried initializing the collective id's in
> > orte/mca/plm/base/plm_base_launch_support.c but that code is never
> > used running the orte-checkpoint tool
> > 
> > Adrian
> > 
> > On Sat, Jan 11, 2014 at 11:46:42AM -0800, Ralph Castain wrote:
> >> I took a look at this, and I'm afraid you have some work to do in the 
> >> orte/mca/snapc code base:
> >> 
> >> 1. you must use dynamically allocated buffers for rml.send_buffer_nb. See 
> >> r30261 for an example of the changes that need to be made - I did some, 
> >> but can't swear to catching them all. It was enough to at least get a proc 
> >> past the initial snapc registration
> >> 
> >> 2. you are reusing collective id's to execute several orte_grpcomm.barrier 
> >> calls - those ids are used elsewhere during MPI_Init. This is not allowed 
> >> - a collective id can only be used *once*. What you need to do is go into 
> >> orte/mca/plm/base/plm_base_launch_support.c and (when cr is configured) 
> >> add cr-specific collective id's for this purpose. I don't know how many 
> >> places in the cr code create their own barriers, but they each need a 
> >> collective id.
> >> 
> >> If you prefer and have the time, you are welcome to extend the collective 
> >> code to allow id reuse. This would require that each daemon and app 
> >> "reset" the collective fields when a collective is declared complete. It 
> >> isn't that hard to do - just never had a reason to do it. I can take a 
> >> shot at it when time permits (may have some time this weekend)
> >> 
> >> 3. when you post the non-blocking recv in the snapc/full code, it looks to 
> >> me like you need to block until you get the answer. I don't know where in 
> >> the code flow this is occurring - if you are not in an event, then it is 
> >> okay to block using ORTE_WAIT_FOR_COMPLETION. Look in 
> >> orte/mca/routed/base/routed_base_fns.c starting at line 252 for an example.
> >> 
> >> HTH
> >> Ralph
> >> 
> >> On Jan 10, 2014, at 12:55 PM, Ralph Castain <r...@open-mpi.org> wrote:
> >> 
> >>

Re: [OMPI devel] callback debugging

2014-01-20 Thread Adrian Reber
Thanks for your help. I tried initializing the barrier correctly (see
attached patch) but now, instead of crashing, it just hangs on the
barrier while running orte-checkpoint

[dcbz:20150] [[41665,0],0] grpcomm:bad entering barrier
[dcbz:20150] [[41665,0],0] ACTIVATING GRCPCOMM OP 0 at 
../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c:206

#0  0x769befa0 in __nanosleep_nocancel () at 
../sysdeps/unix/syscall-template.S:81
#1  0x77b456ba in app_coord_init () at 
../../../../../orte/mca/snapc/full/snapc_full_app.c:207
#2  0x77b3a582 in orte_snapc_full_module_init (seed=false, app=true) at 
../../../../../orte/mca/snapc/full/snapc_full_module.c:207

it hangs looping at ORTE_WAIT_FOR_COMPLETION(coll->active);

I do not understand on what the barrier here is actually waiting for. Where
do I need to look to find the place the barrier is waiting for?

I also tried initializing the collective id's in
orte/mca/plm/base/plm_base_launch_support.c but that code is never
used running the orte-checkpoint tool

Adrian

On Sat, Jan 11, 2014 at 11:46:42AM -0800, Ralph Castain wrote:
> I took a look at this, and I'm afraid you have some work to do in the 
> orte/mca/snapc code base:
> 
> 1. you must use dynamically allocated buffers for rml.send_buffer_nb. See 
> r30261 for an example of the changes that need to be made - I did some, but 
> can't swear to catching them all. It was enough to at least get a proc past 
> the initial snapc registration
> 
> 2. you are reusing collective id's to execute several orte_grpcomm.barrier 
> calls - those ids are used elsewhere during MPI_Init. This is not allowed - a 
> collective id can only be used *once*. What you need to do is go into 
> orte/mca/plm/base/plm_base_launch_support.c and (when cr is configured) add 
> cr-specific collective id's for this purpose. I don't know how many places in 
> the cr code create their own barriers, but they each need a collective id.
> 
> If you prefer and have the time, you are welcome to extend the collective 
> code to allow id reuse. This would require that each daemon and app "reset" 
> the collective fields when a collective is declared complete. It isn't that 
> hard to do - just never had a reason to do it. I can take a shot at it when 
> time permits (may have some time this weekend)
> 
> 3. when you post the non-blocking recv in the snapc/full code, it looks to me 
> like you need to block until you get the answer. I don't know where in the 
> code flow this is occurring - if you are not in an event, then it is okay to 
> block using ORTE_WAIT_FOR_COMPLETION. Look in 
> orte/mca/routed/base/routed_base_fns.c starting at line 252 for an example.
> 
> HTH
> Ralph
> 
> On Jan 10, 2014, at 12:55 PM, Ralph Castain <r...@open-mpi.org> wrote:
> 
> > 
> > On Jan 10, 2014, at 12:45 PM, Adrian Reber <adr...@lisas.de> wrote:
> > 
> >> On Fri, Jan 10, 2014 at 09:48:14AM -0800, Ralph Castain wrote:
> >>> 
> >>> On Jan 10, 2014, at 8:02 AM, Adrian Reber <adr...@lisas.de> wrote:
> >>> 
> >>>> I am currently trying to understand how callbacks are working. Right now
> >>>> I am looking at orte/mca/rml/base/rml_base_receive.c
> >>>> orte_rml_base_comm_start() which does 
> >>>> 
> >>>>   orte_rml.recv_buffer_nb(ORTE_NAME_WILDCARD,
> >>>>   ORTE_RML_TAG_RML_INFO_UPDATE,
> >>>>   ORTE_RML_PERSISTENT,
> >>>>   orte_rml_base_recv,
> >>>>   NULL);
> >>>> 
> >>>> As far as I understand it orte_rml_base_recv() is the callback function.
> >>>> At which point should this function run? When the data is actually
> >>>> received?
> >>> 
> >>> Not precisely. When data is received by the OOB, it pushes the data into 
> >>> an event. When that event gets serviced, it calls the 
> >>> orte_rml_base_receive function which processes the data to find the 
> >>> matching tag, and then uses that to execute the callback to the user code.
> >>> 
> >>>> 
> >>>> The same for send_buffer_nb() functions. I do not see the callback
> >>>> functions actually running. How can I verify that the callback functions
> >>>> are running. Especially for the send case it sounds pretty obvious how
> >>>> it should work but I never see the callback function running. At least
> >>>> in my setup.
> >>> 
> >>> The data is not immediately sent. It gets pushed into an event. When that 
> >&

Re: [OMPI devel] callback debugging

2014-01-10 Thread Adrian Reber
On Fri, Jan 10, 2014 at 09:48:14AM -0800, Ralph Castain wrote:
> 
> On Jan 10, 2014, at 8:02 AM, Adrian Reber <adr...@lisas.de> wrote:
> 
> > I am currently trying to understand how callbacks are working. Right now
> > I am looking at orte/mca/rml/base/rml_base_receive.c
> > orte_rml_base_comm_start() which does 
> > 
> >orte_rml.recv_buffer_nb(ORTE_NAME_WILDCARD,
> >ORTE_RML_TAG_RML_INFO_UPDATE,
> >ORTE_RML_PERSISTENT,
> >orte_rml_base_recv,
> >NULL);
> > 
> > As far as I understand it orte_rml_base_recv() is the callback function.
> > At which point should this function run? When the data is actually
> > received?
> 
> Not precisely. When data is received by the OOB, it pushes the data into an 
> event. When that event gets serviced, it calls the orte_rml_base_receive 
> function which processes the data to find the matching tag, and then uses 
> that to execute the callback to the user code.
> 
> > 
> > The same for send_buffer_nb() functions. I do not see the callback
> > functions actually running. How can I verify that the callback functions
> > are running. Especially for the send case it sounds pretty obvious how
> > it should work but I never see the callback function running. At least
> > in my setup.
> 
> The data is not immediately sent. It gets pushed into an event. When that 
> event gets serviced, it calls the orte_oob_base_send function which then 
> passes the data to each active OOB component until one of them says it can 
> send it. The data is then pushed into another event to get it into the event 
> base for that component's active module - when that event gets serviced, the 
> data is sent. Once the data is sent, an event is created that, when serviced, 
> executes the callback to the user code.
> 
> If you aren't seeing callbacks, the most likely cause is that the orte 
> progress thread isn't running. Without it, none of this will work.

Thanks. Running configure without '--with-ft=cr' I can run a program and
use orte-top. In orterun I can see that the callback is running and
orte-top displays the retrieved information. I can also see in orte-top
that the callbacks are working. Doing the same with '--with-ft=cr'
enabled orte-top crashes as well as orte-checkpoint and both (-top and
-checkpoint) seem to no longer have working callbacks and that is why
they are probably crashing. So some code which is enabled by '--with-ft=cr'
seems to break callbacks in orte-top as well as in orte-checkpoint.
orterun handles callbacks no matter if configured with or without
'--with-ft=cr'.

Adrian


Re: [OMPI devel] orte_barrier: Assertion `0 == item->opal_list_item_refcount' failed.

2014-01-09 Thread Adrian Reber
For my CR work this can probably ignored. I think I was looking at the
wrong place.

On Thu, Jan 09, 2014 at 05:28:01PM +0100, Adrian Reber wrote:
> Continuing with the CR code I now get a crash which can be easily reproduced
> using orte/test/system/orte_barrier.c
> 
> I get:
> 
> orte_barrier: ../../../../../opal/class/opal_list.h:547: _opal_list_append: 
> Assertion `0 == item->opal_list_item_refcount' failed.
> [dcbz:05085] *** Process received signal ***
> [dcbz:05085] Signal: Aborted (6)
> [dcbz:05085] Signal code:  (-6)
> [dcbz:05085] [ 0] /lib64/libpthread.so.0(+0xf750)[0x7f95bca0b750]
> [dcbz:05085] [ 1] /lib64/libc.so.6(gsignal+0x39)[0x7f95bc672c59]
> [dcbz:05085] [ 2] /lib64/libc.so.6(abort+0x148)[0x7f95bc674368]
> [dcbz:05085] [ 3] /lib64/libc.so.6(+0x2ebb6)[0x7f95bc66bbb6]
> [dcbz:05085] [ 4] /lib64/libc.so.6(+0x2ec62)[0x7f95bc66bc62]
> [dcbz:05085] [ 5] 
> /home/adrian/devel/openmpi-trunk/lib/libopen-rte.so.0(+0x86975)[0x7f95bcfbd975]
> [dcbz:05085] [ 6] 
> /home/adrian/devel/openmpi-trunk/lib/libopen-rte.so.0(+0x86d9a)[0x7f95bcfbdd9a]
> [dcbz:05085] [ 7] 
> /home/adrian/devel/openmpi-trunk/lib/libopen-pal.so.0(+0x8c831)[0x7f95bcca5831]
> [dcbz:05085] [ 8] 
> /home/adrian/devel/openmpi-trunk/lib/libopen-pal.so.0(+0x8caa3)[0x7f95bcca5aa3]
> [dcbz:05085] [ 9] 
> /home/adrian/devel/openmpi-trunk/lib/libopen-pal.so.0(opal_libevent2021_event_base_loop+0x2c1)[0x7f95bcca611f]
> [dcbz:05085] [10] 
> /home/adrian/devel/openmpi-trunk/lib/libopen-rte.so.0(+0x2233b)[0x7f95bcf5933b]
> [dcbz:05085] [11] /lib64/libpthread.so.0(+0x7f33)[0x7f95bca03f33]
> [dcbz:05085] [12] /lib64/libc.so.6(clone+0x6d)[0x7f95bc731ead]
> [dcbz:05085] *** End of error message ***
> --
> orterun noticed that process rank 0 with PID 5085 on node dcbz exited on 
> signal 6 (Aborted).
> --
> 
> and in gdb
> 
> [New LWP 5086]
> [New LWP 5085]
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib64/libthread_db.so.1".
> Core was generated by `system/orte_barrier'.
> Program terminated with signal SIGABRT, Aborted.
> #0  0x7f95bc672c59 in __GI_raise (sig=sig@entry=6) at 
> ../nptl/sysdeps/unix/sysv/linux/raise.c:56
> 56  return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
> (gdb) bt
> #0  0x7f95bc672c59 in __GI_raise (sig=sig@entry=6) at 
> ../nptl/sysdeps/unix/sysv/linux/raise.c:56
> #1  0x7f95bc6744a8 in __GI_abort () at abort.c:118
> #2  0x7f95bc66bbb6 in __assert_fail_base (fmt=0x7f95bc7b8ea8 "%s%s%s:%u: 
> %s%sAssertion `%s' failed.\n%n", 
> assertion=assertion@entry=0x7f95bd06d6c0 "0 == 
> item->opal_list_item_refcount", 
> file=file@entry=0x7f95bd06d600 "../../../../../opal/class/opal_list.h", 
> line=line@entry=547, 
> function=function@entry=0x7f95bd06d9d0 <__PRETTY_FUNCTION__.4605> 
> "_opal_list_append") at assert.c:92
> #3  0x7f95bc66bc62 in __GI___assert_fail (assertion=0x7f95bd06d6c0 "0 == 
> item->opal_list_item_refcount", 
> file=0x7f95bd06d600 "../../../../../opal/class/opal_list.h", line=547, 
> function=0x7f95bd06d9d0 <__PRETTY_FUNCTION__.4605> "_opal_list_append") 
> at assert.c:101
> #4  0x7f95bcfbd975 in _opal_list_append (list=0x7f95bd2b9408 
> <orte_grpcomm_base+8>, item=0x1f35be0, 
> FILE_NAME=0x7f95bd06d718 
> "../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c", LINENO=163)
> at ../../../../../opal/class/opal_list.h:547
> #5  0x7f95bcfbdd9a in process_barrier (fd=-1, args=4, cbdata=0x1f35ed0) 
> at ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c:163
> #6  0x7f95bcca5831 in event_process_active_single_queue (base=0x1ef63a0, 
> activeq=0x1ef6360)
> at ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1367
> #7  0x7f95bcca5aa3 in event_process_active (base=0x1ef63a0) at 
> ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1437
> #8  0x7f95bcca611f in opal_libevent2021_event_base_loop (base=0x1ef63a0, 
> flags=1)
> at ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1645
> #9  0x7f95bcf5933b in orte_progress_thread_engine (obj=0x7f95bd2b9160 
> ) at ../../orte/runtime/orte_init.c:180
> #10 0x7f95bca03f33 in start_thread (arg=0x7f95bbb0d700) at 
> pthread_create.c:309
> #11 0x7f95bc731ead in clone () at 
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
> (gdb) 
> 
> As far as I understand it seems to call opal_list_append() twice in
> orte/mca/grpcomm/bad/grpcomm_bad_module.c:163
> 
> opal_list_append(_grpcomm_base.active_colls, >super);
> 
> I have no idea how to fix this.
> 
>   Adrian
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


[OMPI devel] orte_barrier: Assertion `0 == item->opal_list_item_refcount' failed.

2014-01-09 Thread Adrian Reber
Continuing with the CR code I now get a crash which can be easily reproduced
using orte/test/system/orte_barrier.c

I get:

orte_barrier: ../../../../../opal/class/opal_list.h:547: _opal_list_append: 
Assertion `0 == item->opal_list_item_refcount' failed.
[dcbz:05085] *** Process received signal ***
[dcbz:05085] Signal: Aborted (6)
[dcbz:05085] Signal code:  (-6)
[dcbz:05085] [ 0] /lib64/libpthread.so.0(+0xf750)[0x7f95bca0b750]
[dcbz:05085] [ 1] /lib64/libc.so.6(gsignal+0x39)[0x7f95bc672c59]
[dcbz:05085] [ 2] /lib64/libc.so.6(abort+0x148)[0x7f95bc674368]
[dcbz:05085] [ 3] /lib64/libc.so.6(+0x2ebb6)[0x7f95bc66bbb6]
[dcbz:05085] [ 4] /lib64/libc.so.6(+0x2ec62)[0x7f95bc66bc62]
[dcbz:05085] [ 5] 
/home/adrian/devel/openmpi-trunk/lib/libopen-rte.so.0(+0x86975)[0x7f95bcfbd975]
[dcbz:05085] [ 6] 
/home/adrian/devel/openmpi-trunk/lib/libopen-rte.so.0(+0x86d9a)[0x7f95bcfbdd9a]
[dcbz:05085] [ 7] 
/home/adrian/devel/openmpi-trunk/lib/libopen-pal.so.0(+0x8c831)[0x7f95bcca5831]
[dcbz:05085] [ 8] 
/home/adrian/devel/openmpi-trunk/lib/libopen-pal.so.0(+0x8caa3)[0x7f95bcca5aa3]
[dcbz:05085] [ 9] 
/home/adrian/devel/openmpi-trunk/lib/libopen-pal.so.0(opal_libevent2021_event_base_loop+0x2c1)[0x7f95bcca611f]
[dcbz:05085] [10] 
/home/adrian/devel/openmpi-trunk/lib/libopen-rte.so.0(+0x2233b)[0x7f95bcf5933b]
[dcbz:05085] [11] /lib64/libpthread.so.0(+0x7f33)[0x7f95bca03f33]
[dcbz:05085] [12] /lib64/libc.so.6(clone+0x6d)[0x7f95bc731ead]
[dcbz:05085] *** End of error message ***
--
orterun noticed that process rank 0 with PID 5085 on node dcbz exited on signal 
6 (Aborted).
--

and in gdb

[New LWP 5086]
[New LWP 5085]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `system/orte_barrier'.
Program terminated with signal SIGABRT, Aborted.
#0  0x7f95bc672c59 in __GI_raise (sig=sig@entry=6) at 
../nptl/sysdeps/unix/sysv/linux/raise.c:56
56return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
(gdb) bt
#0  0x7f95bc672c59 in __GI_raise (sig=sig@entry=6) at 
../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x7f95bc6744a8 in __GI_abort () at abort.c:118
#2  0x7f95bc66bbb6 in __assert_fail_base (fmt=0x7f95bc7b8ea8 "%s%s%s:%u: 
%s%sAssertion `%s' failed.\n%n", 
assertion=assertion@entry=0x7f95bd06d6c0 "0 == 
item->opal_list_item_refcount", 
file=file@entry=0x7f95bd06d600 "../../../../../opal/class/opal_list.h", 
line=line@entry=547, 
function=function@entry=0x7f95bd06d9d0 <__PRETTY_FUNCTION__.4605> 
"_opal_list_append") at assert.c:92
#3  0x7f95bc66bc62 in __GI___assert_fail (assertion=0x7f95bd06d6c0 "0 == 
item->opal_list_item_refcount", 
file=0x7f95bd06d600 "../../../../../opal/class/opal_list.h", line=547, 
function=0x7f95bd06d9d0 <__PRETTY_FUNCTION__.4605> "_opal_list_append") at 
assert.c:101
#4  0x7f95bcfbd975 in _opal_list_append (list=0x7f95bd2b9408 
, item=0x1f35be0, 
FILE_NAME=0x7f95bd06d718 
"../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c", LINENO=163)
at ../../../../../opal/class/opal_list.h:547
#5  0x7f95bcfbdd9a in process_barrier (fd=-1, args=4, cbdata=0x1f35ed0) at 
../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c:163
#6  0x7f95bcca5831 in event_process_active_single_queue (base=0x1ef63a0, 
activeq=0x1ef6360)
at ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1367
#7  0x7f95bcca5aa3 in event_process_active (base=0x1ef63a0) at 
../../../../../../opal/mca/event/libevent2021/libevent/event.c:1437
#8  0x7f95bcca611f in opal_libevent2021_event_base_loop (base=0x1ef63a0, 
flags=1)
at ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1645
#9  0x7f95bcf5933b in orte_progress_thread_engine (obj=0x7f95bd2b9160 
) at ../../orte/runtime/orte_init.c:180
#10 0x7f95bca03f33 in start_thread (arg=0x7f95bbb0d700) at 
pthread_create.c:309
#11 0x7f95bc731ead in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb) 

As far as I understand it seems to call opal_list_append() twice in
orte/mca/grpcomm/bad/grpcomm_bad_module.c:163

opal_list_append(_grpcomm_base.active_colls, >super);

I have no idea how to fix this.

Adrian


Re: [OMPI devel] return value of opal_compress_base_register() in opal/mca/compress/base/compress_base_open.c

2014-01-07 Thread Adrian Reber
I have commited fixes for

opal/mca/compress/base/compress_base_open.c and
opal/mca/crs/base/crs_base_open.c

which return OPAL_SUCCESS but do not open the components if CR is
disabled.

Adrian

On Tue, Jan 07, 2014 at 01:43:00PM -0600, Josh Hursey wrote:
> Either would be fine with me. If you left in the verbose message then it
> might be a bit confusing to any user that might see it.
> 
> 
> On Fri, Jan 3, 2014 at 9:13 AM, Ralph Castain <r...@open-mpi.org> wrote:
> 
> > That would work. Alternatively, you could try just returning OPAL_SUCCESS
> > in place of OPAL_ERR_NOT_AVAILABLE. This would still avoid opening the
> > components for no reason (thus saving some memory) while not causing
> > opal_init to abort.
> >
> >
> > On Jan 3, 2014, at 3:19 AM, Adrian Reber <adr...@lisas.de> wrote:
> >
> > > So removing all output like in this patch would be ok?
> > >
> > > diff --git a/opal/mca/compress/base/compress_base_open.c
> > b/opal/mca/compress/base/compress_base_open.c
> > > index a09fe59..f487752 100644
> > > --- a/opal/mca/compress/base/compress_base_open.c
> > > +++ b/opal/mca/compress/base/compress_base_open.c
> > > @@ -14,13 +14,9 @@
> > >
> > > #include "opal_config.h"
> > >
> > > -#include 
> > > -#include "opal/mca/mca.h"
> > > #include "opal/mca/base/base.h"
> > > #include "opal/include/opal/constants.h"
> > > -#include "opal/mca/compress/compress.h"
> > > #include "opal/mca/compress/base/base.h"
> > > -#include "opal/util/output.h"
> > >
> > > #include "opal/mca/compress/base/static-components.h"
> > >
> > > @@ -45,13 +41,6 @@ MCA_BASE_FRAMEWORK_DECLARE(opal, compress, NULL,
> > opal_compress_base_register, op
> > >
> > > static int opal_compress_base_register (mca_base_register_flag_t flags)
> > > {
> > > -/* Compression currently only used with C/R */
> > > -if( !opal_cr_is_enabled ) {
> > > -opal_output_verbose(10,
> > opal_compress_base_framework.framework_output,
> > > -"compress:open: FT is not enabled,
> > skipping!");
> > > -return OPAL_ERR_NOT_AVAILABLE;
> > > -}
> > > -
> > > return OPAL_SUCCESS;
> > > }
> > >
> > > @@ -61,13 +50,6 @@ static int opal_compress_base_register
> > (mca_base_register_flag_t flags)
> > >  */
> > > int opal_compress_base_open(mca_base_open_flag_t flags)
> > > {
> > > -/* Compression currently only used with C/R */
> > > -if( !opal_cr_is_enabled ) {
> > > -opal_output_verbose(10,
> > opal_compress_base_framework.framework_output,
> > > -"compress:open: FT is not enabled,
> > skipping!");
> > > -return OPAL_SUCCESS;
> > > -}
> > > -
> > > /* Open up all available components */
> > > return mca_base_framework_components_open
> > (_compress_base_framework, flags);
> > > }
> > >
> > >
> > >
> > > On Thu, Jan 02, 2014 at 12:32:32PM -0500, Josh Hursey wrote:
> > >> I think the only reason I protected that framework is to reduce the
> > >> overhead of an application using a build of Open MPI with CR support,
> > but
> > >> no enabling it at runtime. Nothing in the compress framework depends on
> > the
> > >> CR infrastructure (although the CR infrastructure can use the compress
> > >> framework if the user chooses to). So I bet we can remove the protection
> > >> altogether and be fine.
> > >>
> > >> So I think this patch is fine. I might also go as far as removing the
> > 'if'
> > >> block altogether as the protection should not been needed any longer.
> > >>
> > >> Thanks,
> > >> Josh
> > >>
> > >>
> > >>
> > >> On Fri, Dec 27, 2013 at 3:46 PM, Adrian Reber <adr...@lisas.de> wrote:
> > >>
> > >>> Right now the C/R code fails because of a change introduced in
> > >>> opal/mca/compress/base/compress_base_open.c in 2013 with commit
> > >>>
> > >>> git 734c724ff76d9bf814f3ab0396bcd9ee6fddcd1b
> > >>> svn r28239
> > >>>
> > >>>Update OPAL frameworks to use the MCA framework system.
> > >>>
> > &g

Re: [OMPI devel] return value of opal_compress_base_register() in opal/mca/compress/base/compress_base_open.c

2014-01-03 Thread Adrian Reber
So removing all output like in this patch would be ok?

diff --git a/opal/mca/compress/base/compress_base_open.c 
b/opal/mca/compress/base/compress_base_open.c
index a09fe59..f487752 100644
--- a/opal/mca/compress/base/compress_base_open.c
+++ b/opal/mca/compress/base/compress_base_open.c
@@ -14,13 +14,9 @@

 #include "opal_config.h"

-#include 
-#include "opal/mca/mca.h"
 #include "opal/mca/base/base.h"
 #include "opal/include/opal/constants.h"
-#include "opal/mca/compress/compress.h"
 #include "opal/mca/compress/base/base.h"
-#include "opal/util/output.h"

 #include "opal/mca/compress/base/static-components.h"

@@ -45,13 +41,6 @@ MCA_BASE_FRAMEWORK_DECLARE(opal, compress, NULL, 
opal_compress_base_register, op

 static int opal_compress_base_register (mca_base_register_flag_t flags)
 {
-/* Compression currently only used with C/R */
-if( !opal_cr_is_enabled ) {
-opal_output_verbose(10, opal_compress_base_framework.framework_output,
-"compress:open: FT is not enabled, skipping!");
-return OPAL_ERR_NOT_AVAILABLE;
-}
-
 return OPAL_SUCCESS;
 }

@@ -61,13 +50,6 @@ static int opal_compress_base_register 
(mca_base_register_flag_t flags)
  */
 int opal_compress_base_open(mca_base_open_flag_t flags)
 {
-/* Compression currently only used with C/R */
-if( !opal_cr_is_enabled ) {
-opal_output_verbose(10, opal_compress_base_framework.framework_output,
-"compress:open: FT is not enabled, skipping!");
-return OPAL_SUCCESS;
-}
-
 /* Open up all available components */
 return mca_base_framework_components_open (_compress_base_framework, 
flags);
 }



On Thu, Jan 02, 2014 at 12:32:32PM -0500, Josh Hursey wrote:
> I think the only reason I protected that framework is to reduce the
> overhead of an application using a build of Open MPI with CR support, but
> no enabling it at runtime. Nothing in the compress framework depends on the
> CR infrastructure (although the CR infrastructure can use the compress
> framework if the user chooses to). So I bet we can remove the protection
> altogether and be fine.
> 
> So I think this patch is fine. I might also go as far as removing the 'if'
> block altogether as the protection should not been needed any longer.
> 
> Thanks,
> Josh
> 
> 
> 
> On Fri, Dec 27, 2013 at 3:46 PM, Adrian Reber <adr...@lisas.de> wrote:
> 
> > Right now the C/R code fails because of a change introduced in
> > opal/mca/compress/base/compress_base_open.c in 2013 with commit
> >
> > git 734c724ff76d9bf814f3ab0396bcd9ee6fddcd1b
> > svn r28239
> >
> > Update OPAL frameworks to use the MCA framework system.
> >
> > This commit changed a lot but also the return value of the function from
> > OPAL_SUCCESS to OPAL_ERR_NOT_AVAILABLE. With following patch I can make
> > it work again:
> >
> > diff --git a/opal/mca/compress/base/compress_base_open.c
> > b/opal/mca/compress/base/compress_base_open.c
> > index a09fe59..69eabfa 100644
> > --- a/opal/mca/compress/base/compress_base_open.c
> > +++ b/opal/mca/compress/base/compress_base_open.c
> > @@ -45,11 +45,11 @@ MCA_BASE_FRAMEWORK_DECLARE(opal, compress, NULL,
> > opal_compress_base_register, op
> >
> >  static int opal_compress_base_register (mca_base_register_flag_t flags)
> >  {
> >  /* Compression currently only used with C/R */
> >  if( !opal_cr_is_enabled ) {
> >  opal_output_verbose(10,
> > opal_compress_base_framework.framework_output,
> >  "compress:open: FT is not enabled,
> > skipping!");
> > -return OPAL_ERR_NOT_AVAILABLE;
> > +return OPAL_SUCCESS;
> >  }
> >
> >  return OPAL_SUCCESS;
> >
> >
> > My question is if OPAL_ERR_NOT_AVAILABLE is indeed the correct return value
> > and the function calling opal_compress_base_register() should actually
> > handle OPAL_ERR_NOT_AVAILABLE or if returning OPAL_SUCCESS is still the
> > right
> > return value?
> >
> > Adrian
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >
> 
> 
> 
> -- 
> Joshua Hursey
> Assistant Professor of Computer Science
> University of Wisconsin-La Crosse
> http://cs.uwlax.edu/~jjhursey

> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


[OMPI devel] return value of opal_compress_base_register() in opal/mca/compress/base/compress_base_open.c

2013-12-27 Thread Adrian Reber
Right now the C/R code fails because of a change introduced in
opal/mca/compress/base/compress_base_open.c in 2013 with commit

git 734c724ff76d9bf814f3ab0396bcd9ee6fddcd1b
svn r28239

Update OPAL frameworks to use the MCA framework system.

This commit changed a lot but also the return value of the function from
OPAL_SUCCESS to OPAL_ERR_NOT_AVAILABLE. With following patch I can make
it work again:

diff --git a/opal/mca/compress/base/compress_base_open.c 
b/opal/mca/compress/base/compress_base_open.c
index a09fe59..69eabfa 100644
--- a/opal/mca/compress/base/compress_base_open.c
+++ b/opal/mca/compress/base/compress_base_open.c
@@ -45,11 +45,11 @@ MCA_BASE_FRAMEWORK_DECLARE(opal, compress, NULL, 
opal_compress_base_register, op

 static int opal_compress_base_register (mca_base_register_flag_t flags)
 {
 /* Compression currently only used with C/R */
 if( !opal_cr_is_enabled ) {
 opal_output_verbose(10, opal_compress_base_framework.framework_output,
 "compress:open: FT is not enabled, skipping!");
-return OPAL_ERR_NOT_AVAILABLE;
+return OPAL_SUCCESS;
 }

 return OPAL_SUCCESS;


My question is if OPAL_ERR_NOT_AVAILABLE is indeed the correct return value
and the function calling opal_compress_base_register() should actually
handle OPAL_ERR_NOT_AVAILABLE or if returning OPAL_SUCCESS is still the right
return value?

Adrian


Re: [OMPI devel] C/R code: opal_list_item_destruct: Assertion

2013-12-22 Thread Adrian Reber
That works. Thanks for your fix.

On Sun, Dec 22, 2013 at 12:23:44AM +0100, George Bosilca wrote:
> Adrian,
> 
> Yes, your patch is correct. However, I noticed that each framework clean it’s 
> modules differently, so I tried to enforce some level of consistency. Please 
> try r30045 and let me know if it fixes your issue.
> 
> George.
> 
> 
> On Dec 21, 2013, at 22:05 , Adrian Reber <adr...@lisas.de> wrote:
> 
> > Trying to run Open MPI with C/R enabled I get the following error
> > with --enable-debug:
> > 
> > [dcbz:20360] orte_rml_base_select: initializing rml component oob
> > [dcbz:20360] orte_rml_base_select: initializing rml component ftrm
> > [dcbz:20360] orte_rml_base_select: module ftrm unloaded
> > orterun: ../../opal/class/opal_list.c:69: opal_list_item_destruct: 
> > Assertion `0 == item->opal_list_item_refcount' failed.
> > [dcbz:20360] *** Process received signal ***
> > [dcbz:20360] Signal: Aborted (6)
> > [dcbz:20360] Signal code:  (-6)
> > 
> > I fixed it like this:
> > 
> > diff --git a/orte/mca/rml/base/rml_base_frame.c 
> > b/orte/mca/rml/base/rml_base_frame.c
> > index 8759180..968884f 100644
> > --- a/orte/mca/rml/base/rml_base_frame.c
> > +++ b/orte/mca/rml/base/rml_base_frame.c
> > @@ -181,6 +181,7 @@ int orte_rml_base_select(void)
> > component->rml_version.mca_component_name);
> > 
> > mca_base_component_repository_release((mca_base_component_t *) 
> > component);
> > +
> > opal_list_remove_item(_rml_base_framework.framework_components, item);
> > OBJ_RELEASE(item);
> > }
> > item = next;
> > 
> > 
> > Is this the correct way to solve an error like this? And the
> > correct place.
> > 
> > Adrian
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


[OMPI devel] C/R code: opal_list_item_destruct: Assertion

2013-12-21 Thread Adrian Reber
Trying to run Open MPI with C/R enabled I get the following error
with --enable-debug:

[dcbz:20360] orte_rml_base_select: initializing rml component oob
[dcbz:20360] orte_rml_base_select: initializing rml component ftrm
[dcbz:20360] orte_rml_base_select: module ftrm unloaded
orterun: ../../opal/class/opal_list.c:69: opal_list_item_destruct: Assertion `0 
== item->opal_list_item_refcount' failed.
[dcbz:20360] *** Process received signal ***
[dcbz:20360] Signal: Aborted (6)
[dcbz:20360] Signal code:  (-6)

I fixed it like this:

diff --git a/orte/mca/rml/base/rml_base_frame.c 
b/orte/mca/rml/base/rml_base_frame.c
index 8759180..968884f 100644
--- a/orte/mca/rml/base/rml_base_frame.c
+++ b/orte/mca/rml/base/rml_base_frame.c
@@ -181,6 +181,7 @@ int orte_rml_base_select(void)
 component->rml_version.mca_component_name);

 mca_base_component_repository_release((mca_base_component_t *) 
component);
+
opal_list_remove_item(_rml_base_framework.framework_components, item);
 OBJ_RELEASE(item);
 }
 item = next;


Is this the correct way to solve an error like this? And the
correct place.

Adrian


Re: [OMPI devel] [PATCH v3 0/2] Trying to get the C/R code to compile again

2013-12-20 Thread Adrian Reber
On Thu, Dec 19, 2013 at 09:54:19PM +0100, Adrian Reber wrote:
> This is the second try to replace the usage of blocking send and
> recv in the C/R code with the non-blocking versions. The new code
> compiles (in contrast to the old code) but does not work yet.
> This is the first step to get the C/R code working again. Right
> now it only compiles.
> 
> Changes from V1:
> * #ifdef out the broken code (so it is preserved for later re-design)
> * marked the broken C/R code with ENABLE_FT_FIXED
> 
> Changes from V2:
> * only #ifdef out parts where the behaviour actually changes
> 
> Adrian Reber (2):
>   Trying to get the C/R code to compile again. (recv_*_nb)
>   Trying to get the C/R code to compile again. (send_*_nb)

Thanks for all the reviews. All patches are now committed.

Adrian


[OMPI devel] [PATCH v3 0/2] Trying to get the C/R code to compile again

2013-12-19 Thread Adrian Reber
From: Adrian Reber <adrian.re...@hs-esslingen.de>

This is the second try to replace the usage of blocking send and
recv in the C/R code with the non-blocking versions. The new code
compiles (in contrast to the old code) but does not work yet.
This is the first step to get the C/R code working again. Right
now it only compiles.

Changes from V1:
* #ifdef out the broken code (so it is preserved for later re-design)
* marked the broken C/R code with ENABLE_FT_FIXED

Changes from V2:
* only #ifdef out parts where the behaviour actually changes

Adrian Reber (2):
  Trying to get the C/R code to compile again. (recv_*_nb)
  Trying to get the C/R code to compile again. (send_*_nb)

 ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c|  64 +--
 orte/mca/errmgr/base/errmgr_base_tool.c |  20 +---
 orte/mca/rml/ftrm/rml_ftrm.h|  46 +---
 orte/mca/rml/ftrm/rml_ftrm_component.c  |   4 -
 orte/mca/rml/ftrm/rml_ftrm_module.c | 139 +++-
 orte/mca/snapc/full/snapc_full_app.c|  32 +-
 orte/mca/snapc/full/snapc_full_global.c |  52 -
 orte/mca/snapc/full/snapc_full_local.c  |  40 ++-
 orte/mca/sstore/central/sstore_central_app.c|  14 ++-
 orte/mca/sstore/central/sstore_central_global.c |  21 +---
 orte/mca/sstore/central/sstore_central_local.c  |  29 ++---
 orte/mca/sstore/stage/sstore_stage_app.c|  13 ++-
 orte/mca/sstore/stage/sstore_stage_global.c |  21 +---
 orte/mca/sstore/stage/sstore_stage_local.c  |  33 +++---
 orte/tools/orte-checkpoint/orte-checkpoint.c|  20 +---
 orte/tools/orte-migrate/orte-migrate.c  |  20 +---
 16 files changed, 186 insertions(+), 382 deletions(-)

-- 
1.8.4.2



[OMPI devel] [PATCH v3 1/2] Trying to get the C/R code to compile again. (recv_*_nb)

2013-12-19 Thread Adrian Reber
From: Adrian Reber <adrian.re...@hs-esslingen.de>

This patch changes all recv/recv_buffer occurrences in the C/R code
to recv_nb/recv_buffer_nb.
The old code is still there but disabled using ifdefs (ENABLE_FT_FIXED).
The new code compiles but does not work.

Changes from V1:
* #ifdef out the code (so it is preserved for later re-design)
* marked the broken C/R code with ENABLE_FT_FIXED

Changes from V2:
* only #ifdef out the code where the behaviour is changed
  (used to be blocking; now non-blocking)

Signed-off-by: Adrian Reber <adrian.re...@hs-esslingen.de>
---
 ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c| 41 +
 orte/mca/errmgr/base/errmgr_base_tool.c | 16 +
 orte/mca/rml/ftrm/rml_ftrm.h| 27 ++---
 orte/mca/rml/ftrm/rml_ftrm_component.c  |  2 -
 orte/mca/rml/ftrm/rml_ftrm_module.c | 78 +++--
 orte/mca/snapc/full/snapc_full_app.c| 12 
 orte/mca/snapc/full/snapc_full_global.c | 37 +++-
 orte/mca/snapc/full/snapc_full_local.c  | 36 +++-
 orte/mca/sstore/central/sstore_central_app.c|  6 ++
 orte/mca/sstore/central/sstore_central_global.c | 17 +-
 orte/mca/sstore/central/sstore_central_local.c  | 17 +-
 orte/mca/sstore/stage/sstore_stage_app.c|  5 ++
 orte/mca/sstore/stage/sstore_stage_global.c | 17 +-
 orte/mca/sstore/stage/sstore_stage_local.c  | 17 +-
 orte/tools/orte-checkpoint/orte-checkpoint.c| 16 +
 orte/tools/orte-migrate/orte-migrate.c  | 16 +
 16 files changed, 87 insertions(+), 273 deletions(-)

diff --git a/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c 
b/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c
index 5d4005f..05cd598 100644
--- a/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c
+++ b/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c
@@ -4717,7 +4717,6 @@ static int ft_event_post_drain_acks(void)
 ompi_crcp_bkmrk_pml_drain_message_ack_ref_t * drain_msg_ack = NULL;
 opal_list_item_t* item = NULL;
 size_t req_size;
-int ret;

 req_size  = opal_list_get_size(_msg_ack_list);
 if(req_size <= 0) {
@@ -4739,17 +4738,8 @@ static int ft_event_post_drain_acks(void)
 drain_msg_ack = (ompi_crcp_bkmrk_pml_drain_message_ack_ref_t*)item;

 /* Post the receive */
-if( OMPI_SUCCESS != (ret = ompi_rte_recv_buffer_nb( 
_msg_ack->peer,
-
OMPI_CRCP_COORD_BOOKMARK_TAG,
-0,
-
drain_message_ack_cbfunc,
-NULL) ) ) {
-opal_output(mca_crcp_bkmrk_component.super.output_handle,
-"crcp:bkmrk: %s <-- %s: Failed to post a RML receive 
to the peer\n",
-OMPI_NAME_PRINT(OMPI_PROC_MY_NAME),
-OMPI_NAME_PRINT(&(drain_msg_ack->peer)));
-return ret;
-}
+ompi_rte_recv_buffer_nb(_msg_ack->peer, 
OMPI_CRCP_COORD_BOOKMARK_TAG,
+0, drain_message_ack_cbfunc, NULL);
 }

 return OMPI_SUCCESS;
@@ -5322,26 +5312,14 @@ static int send_bookmarks(int peer_idx)
 static int recv_bookmarks(int peer_idx)
 {
 ompi_process_name_t peer_name;
-int exit_status = OMPI_SUCCESS;
-int ret;

 START_TIMER(CRCP_TIMER_CKPT_EX_PEER_R);

 peer_name.jobid  = OMPI_PROC_MY_NAME->jobid;
 peer_name.vpid   = peer_idx;

-if ( 0 > (ret = ompi_rte_recv_buffer_nb(_name,
-OMPI_CRCP_COORD_BOOKMARK_TAG,
-0,
-recv_bookmarks_cbfunc,
-NULL) ) ) {
-opal_output(mca_crcp_bkmrk_component.super.output_handle,
-"crcp:bkmrk: recv_bookmarks: Failed to post receive 
bookmark from peer %s: Return %d\n",
-OMPI_NAME_PRINT(_name),
-ret);
-exit_status = ret;
-goto cleanup;
-}
+ompi_rte_recv_buffer_nb(_name, OMPI_CRCP_COORD_BOOKMARK_TAG,
+0, recv_bookmarks_cbfunc, NULL);

 ++total_recv_bookmarks;

@@ -5350,7 +5328,7 @@ static int recv_bookmarks(int peer_idx)
 /* JJH Doesn't make much sense to print this. The real bottleneck is 
always the send_bookmarks() */
 /*DISPLAY_INDV_TIMER(CRCP_TIMER_CKPT_EX_PEER_R, peer_idx, 1);*/

-return exit_status;
+return OMPI_SUCCESS;
 }

 static void recv_bookmarks_cbfunc(int status,
@@ -5616,6 +5594,8 @@ static int 
do_send_msg_detail(ompi_crcp_bkmrk_pml_peer_ref_t *peer_ref,
 /*
  * Recv the ACK msg
  */
+#ifdef ENABLE_FT_FIXED
+/* This is the old, now broken code */
 if ( 0 > (ret = ompi_rte_recv_buffer(_ref->proc_name, buffer,
  

  1   2   >