Re: [OMPI devel] 1.10.1 overnight failures - Fortran

2015-10-22 Thread Ralph Castain
Thanks!

> On Oct 22, 2015, at 6:42 PM, Gilles Gouaillardet  wrote:
> 
> Ralph,
> 
> i made PR #711 https://github.com/open-mpi/ompi-release/pull/711 
>  to fix this issue
> 
> Cheers,
> 
> Gilles
> 
> On 10/23/2015 7:39 AM, Gilles Gouaillardet wrote:
>> Ralph,
>> 
>> these are MPI 3 functions that did not land yet into the v1.10 series.
>> only MPI_Aint arithmetic functions landed into v1.10 so it seems configure 
>> is confused
>> (e.g. this test was previously not built, and now it is ...)
>> 
>> I ll try to back port the missing functions
>> 
>> Cheers,
>> 
>> Gilles
>> 
>> On Friday, October 23, 2015, Ralph Castain < 
>> r...@open-mpi.org > 
>> wrote:
>> Hi folks
>> 
>> I’m seeing a bunch of build failures in the overnight tests with this 
>> signature:
>> 
>> aint_mpifh.o: In function `do_the_test_':
>> aint_mpifh.f90:(.text+0x138): undefined reference to 
>> `mpi_win_create_dynamic_'
>> aint_mpifh.f90:(.text+0x16b): undefined reference to `mpi_win_attach_'
>> aint_mpifh.f90:(.text+0x34c): undefined reference to `mpi_win_detach_'
>> collect2: error: ld returned 1 exit status
>> 
>> 
>> Looks to me like something got left out of the prior PRs?
>> Ralph
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org 
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>> 
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2015/10/18243.php 
>> 
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org 
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>> 
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2015/10/18246.php 
>> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/10/18247.php



Re: [OMPI devel] 1.10.1 overnight failures - Fortran

2015-10-22 Thread Gilles Gouaillardet

Ralph,

i made PR #711 https://github.com/open-mpi/ompi-release/pull/711 to fix 
this issue


Cheers,

Gilles

On 10/23/2015 7:39 AM, Gilles Gouaillardet wrote:

Ralph,

these are MPI 3 functions that did not land yet into the v1.10 series.
only MPI_Aint arithmetic functions landed into v1.10 so it seems 
configure is confused

(e.g. this test was previously not built, and now it is ...)

I ll try to back port the missing functions

Cheers,

Gilles

On Friday, October 23, 2015, Ralph Castain > wrote:


Hi folks

I’m seeing a bunch of build failures in the overnight tests with
this signature:

aint_mpifh.o: In function `do_the_test_':
aint_mpifh.f90:(.text+0x138): undefined reference to
`mpi_win_create_dynamic_'
aint_mpifh.f90:(.text+0x16b): undefined reference to `mpi_win_attach_'
aint_mpifh.f90:(.text+0x34c): undefined reference to `mpi_win_detach_'
collect2: error: ld returned 1 exit status


Looks to me like something got left out of the prior PRs?
Ralph

___
devel mailing list
de...@open-mpi.org 
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post:
http://www.open-mpi.org/community/lists/devel/2015/10/18243.php



___
devel mailing list
de...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2015/10/18246.php




Re: [OMPI devel] 1.10.1 overnight failures - Fortran

2015-10-22 Thread Gilles Gouaillardet
Ralph,

these are MPI 3 functions that did not land yet into the v1.10 series.
only MPI_Aint arithmetic functions landed into v1.10 so it seems configure
is confused
(e.g. this test was previously not built, and now it is ...)

I ll try to back port the missing functions

Cheers,

Gilles

On Friday, October 23, 2015, Ralph Castain  wrote:

> Hi folks
>
> I’m seeing a bunch of build failures in the overnight tests with this
> signature:
>
> aint_mpifh.o: In function `do_the_test_':
> aint_mpifh.f90:(.text+0x138): undefined reference to
> `mpi_win_create_dynamic_'
> aint_mpifh.f90:(.text+0x16b): undefined reference to `mpi_win_attach_'
> aint_mpifh.f90:(.text+0x34c): undefined reference to `mpi_win_detach_'
> collect2: error: ld returned 1 exit status
>
>
> Looks to me like something got left out of the prior PRs?
> Ralph
>
> ___
> devel mailing list
> de...@open-mpi.org 
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/10/18243.php


[OMPI devel] How is session dir used?

2015-10-22 Thread Justin Cinkelj
Normally, mpi_run starts via ssh on remote node orted process, and orted
start mpi_program via fork+exec.
orted and mpi_program communicate via:
- environment variables (ok, that's on-time setup only, but still)
- pipes (only one, right? - it is close-on-exec by child).
- file descriptors, mpi_program stdin/out/err are redirected in orted
between fork and exec.
- socket (only one?), mpi_program connects to OMPI_MCA_orte_local_daemon_uri
- session dir (OMPI_FILE_LOCATION)
(did I miss anything - or, how much?)

How is session dir used? I saw check for aborted file (so that orted can
figure out if child died?). Is there any other use of that dir?

Can I just ignore it, if I try to run orted on host, and mpi_program in
container/virtual machine?



[OMPI devel] Fwd: mtt-submit, etc.

2015-10-22 Thread Howard Pritchard
Hi Folks,

I don't seem to have gotten subscribed yet to mtt-users mail list so
forwarding to the dev team.

Howard

-- Forwarded message --
From: Howard Pritchard 
List-Post: devel@lists.open-mpi.org
Date: 2015-10-22 10:18 GMT-06:00
Subject: mtt-submit, etc.
To: mtt-us...@open-mpi.org


HI Folks,

I have the following issue with a cluster I would like to use for
submitting MTT results
for Open MPI, namely, that the nodes on which I have to submit batch jobs
to run
the tests don't have external internet connectivity, so if my mtt ini file
has a IU database reporter
section, the run dies in the "ping the mtt server" test.

What I have right now is a two-stage process where I checkout and
compile/build
Open MPI and the tests on a front end which does have access to the mtt
server.
This part works and gets reported back to IU database.

I can run the tests using mtt, but have to disable all the mtt server
reporter stuff.

I thought I could use mtt-submit to submit some kind of mttdatabase debug
file
back to IU once the batch job has completed, but I can't figure out a way
to generate this file without enable the mtt server reporter section in the
ini file,
and so back to the ping failure issue.

Would anyone have suggestions on how to work around this problem?

Thanks,

Howard


[OMPI devel] 1.10.1 overnight failures - Fortran

2015-10-22 Thread Ralph Castain
Hi folks

I’m seeing a bunch of build failures in the overnight tests with this signature:

aint_mpifh.o: In function `do_the_test_':
aint_mpifh.f90:(.text+0x138): undefined reference to `mpi_win_create_dynamic_'
aint_mpifh.f90:(.text+0x16b): undefined reference to `mpi_win_attach_'
aint_mpifh.f90:(.text+0x34c): undefined reference to `mpi_win_detach_'
collect2: error: ld returned 1 exit status


Looks to me like something got left out of the prior PRs?
Ralph



Re: [OMPI devel] Checkpoint/restart + migration

2015-10-22 Thread Gilles Gouaillardet
Gianmario,

there was c/r support in the v1.6 series but it has been removed.
the current trend is to do application level checkpointing
(much more efficient and much smaller checkpoint file size)

iirc, ompi took care of closing/restoring all communication, and a third
party checkpoint was required to checkpoint/restart *standalone* processes.

generally speaking, mpirun and orted communicate via tcp
orted and MPI (intra node comms) currently use tcp but we are moving to
unix sockets
MPI tasks communicate via btl (infiniband, tcp, shared memory, ...)

imho, moving only one MPI task to an other node is much harder, not to say
impossible, than moving orted and its children MPI tasks to an other node

Cheers,

Gilles

On Thursday, October 22, 2015, Gianmario Pozzi 
wrote:

> Hi everyone!
>
> My team and I are working on the possibility to checkpoint a process and
> restarting it on another node. We are using CRIU framework for the
> checkpoint/restart part, but we are facing some issues related to migration.
>
> First of all: we found out that some attempts to C/R an OMPI process have
> been already made in the past. Is anything related to that still
> supported/available/working?
>
> Then, we need to know which network communications are used at any time,
> in order to "pause" them during migrations (at least the ones involving the
> migrating node). Our code analysis makes us think that:
> -OpenMPI runtime (HNP<->orteds) uses orte/OOB
> -Running applications exchange data via ompi/BTL
>
> Is that correct? If not, can someone give us a hint?
>
> Questions on how to update topology info may be yet to come.
>
> Thank you guys!
>
> Gianmario
>


Re: [OMPI devel] Checkpoint/restart + migration

2015-10-22 Thread Adrian Reber
On Thu, Oct 22, 2015 at 12:15:22PM +0200, Gianmario Pozzi wrote:
> My team and I are working on the possibility to checkpoint a process and
> restarting it on another node. We are using CRIU framework for the
> checkpoint/restart part, but we are facing some issues related to migration.
> 
> First of all: we found out that some attempts to C/R an OMPI process have
> been already made in the past. Is anything related to that still
> supported/available/working?

I was working on the CRIU <-> OpenMPI integration during 2013/2014. The
code is still available at:

https://github.com/open-mpi/ompi/tree/master/opal/mca/crs/criu

I was able to checkpoint and restart a process under OpenMPI's control:

http://lisas.de/~adrian/?p=926

>From what I have heard/read OpenMPI has probably had enough internal
changes that the Fault Tolerance framework is currently no longer
working which is needed to use the checkpoint/restart functionality.

In addition, CRIU has also changed a bit. I used the criu service daemon
to start the checkpoint. This service daemon no longer exists due to
security concerns:

https://lwn.net/Articles/658070/

So you either need to call the criu binary directly or you can use 'criu
swrk'.

Restore should be easier as criu now supports the option --inherit-fd
which should help to correctly re-route stdin/stdout/stderr.

Adrian


[OMPI devel] Checkpoint/restart + migration

2015-10-22 Thread Gianmario Pozzi
Hi everyone!

My team and I are working on the possibility to checkpoint a process and
restarting it on another node. We are using CRIU framework for the
checkpoint/restart part, but we are facing some issues related to migration.

First of all: we found out that some attempts to C/R an OMPI process have
been already made in the past. Is anything related to that still
supported/available/working?

Then, we need to know which network communications are used at any time, in
order to "pause" them during migrations (at least the ones involving the
migrating node). Our code analysis makes us think that:
-OpenMPI runtime (HNP<->orteds) uses orte/OOB
-Running applications exchange data via ompi/BTL

Is that correct? If not, can someone give us a hint?

Questions on how to update topology info may be yet to come.

Thank you guys!

Gianmario