Re: [OMPI devel] OMPI devel] Checkpoint/restart + migration

2015-11-13 Thread Federico Reghenzani
2015-10-26 8:04 GMT+01:00 Gilles Gouaillardet :

> Federico,
>
> that looks good to me.
> the image does not show the channel between orded and its children.
> this is a currently a TCP socket (v1.10) and we are moving to Unix socket
> (already in master)
>
>
Which is the framework involved in this communication? I'm not sure what
this channel is used for.


Cheers,
>
> Gilles
>
>
> On 10/26/2015 3:28 PM, Federico Reghenzani wrote:
>
> Hi Gilles,
> t​​hank you again for your great answer. Our idea is to migrate tasks
> between nodes, possibly individually, and other tasks still run (obviously,
> if they want to communicate with "migrating" node, we should pause them).
>
>
> Just to be sure if we have understood correctly, is the attached image
> exact?
>
> Cheers,
> Federico
> __
> Federico Reghenzani
> M.Eng. Student @ Politecnico di Milano
> Computer Science and Engineering
>
>
>
> 2015-10-23 11:45 GMT+02:00 Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com>:
>
>> Gianmario,
>>
>> Iirc, there is one pipe between orted and each children stderr.
>> stdout is a pty, and stdin is /dev/null, but it might be a pipe on task 0
>> This is the way stdout/stderr from tasks end up being printed by mpirun :
>> orted does i/o forwarding (aka IOF)
>>
>> are you trying to migrate only one task (and other tasks still run) or
>> are you trying to checkpoint and restart on a different set of nodes ?
>>
>> Typically, a task uses shared memory for intra node communications, and
>> infiniband or tcp for inter node communications.
>> So if you migrate only one task, and i assume you have no virtual shared
>> memory, then you need to notify its neighbors they have to switch from shm
>> to ib/tcp.
>> At first glance, that is much harder than moving orted and its children :
>> You would "only" have to re-establish all connections and migrate the shm.
>> Also, orted assumes/need its children are running on the same node, (they
>> use a session dir in /tmp, orted waits SIGCHLD when its child dies,...) so
>> if you migrate everything, you do not have to worry about that part.
>>
>> You might also want to consider some virtualization :
>> If a node is running in its own vm, or its own container with a virtual
>> ip, you could reuse existing infrastructure at least to migrate orted and
>> its tcp/ip connections
>>
>> Cheers,
>>
>> Gilles
>>
>> Federico Reghenzani  wrote:
>> Hi Adrian and Gilles,
>>
>> first of all thank you for your responses. I'm working with Gianmario on
>> this ambitious project.
>>
>> 2015-10-22 13:17 GMT+02:00 Gilles Gouaillardet <
>> gilles.gouaillar...@gmail.com>:
>>
>>> Gianmario,
>>>
>>> there was c/r support in the v1.6 series but it has been removed.
>>> the current trend is to do application level checkpointing
>>> (much more efficient and much smaller checkpoint file size)
>>>
>>> iirc, ompi took care of closing/restoring all communication, and a third
>>> party checkpoint was required to checkpoint/restart *standalone* processes.
>>>
>>> generally speaking, mpirun and orted communicate via tcp
>>> orted and MPI (intra node comms) currently use tcp but we are moving to
>>> unix sockets
>>> MPI tasks communicate via btl (infiniband, tcp, shared memory, ...)
>>>
>>>
>> We have also seen that orted opens 2 pipe to each child, is it correct?
>> Does orted use them to communicate with children?
>>
>>
>>
>>> imho, moving only one MPI task to an other node is much harder, not to
>>> say impossible, than moving orted and its children MPI tasks to an other
>>> node
>>>
>>>
>> Mmm, I can ask you why? I mean, if we migrate the entire orted we need to
>> close/reopen *mpirun-orted* and *task-task* (btl) sockets, and if we
>> migrate the single task we need to close/reopen *orte-task* and
>> *task-task *sockets. In both cases we have to broadcast the information
>> of "changing location" of the task or orted.
>>
>>
>>
>>> Cheers,
>>>
>>> Gilles
>>>
>>>
>>> On Thursday, October 22, 2015, Gianmario Pozzi 
>>> wrote:
>>>
 Hi everyone!

 My team and I are working on the possibility to checkpoint a process
 and restarting it on another node. We are using CRIU framework for the
 checkpoint/restart part, but we are facing some issues related to 
 migration.

 First of all: we found out that some attempts to C/R an OMPI process
 have been already made in the past. Is anything related to that still
 supported/available/working?

 Then, we need to know which network communications are used at any
 time, in order to "pause" them during migrations (at least the ones
 involving the migrating node). Our code analysis makes us think that:
 -OpenMPI runtime (HNP<->orteds) uses orte/OOB
 -Running applications exchange data via ompi/BTL

 Is that correct? If not, can someone give us a hint?

 Questions on how to update topology info may be yet to come.

 Thank you guys!

 Gianmario

>>>
>>> ___

Re: [OMPI devel] Checkpoint/restart + migration

2015-10-27 Thread Gianmario Pozzi
Thank you guys, your help is really appriciated! We'll keep in touch for
further information.

Gianmario

Il 23/ott/2015 12:44 "Jeff Squyres (jsquyres)"  ha
scritto:

> On Oct 22, 2015, at 7:17 AM, Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com> wrote:
> >
> > Gianmario,
> >
> > there was c/r support in the v1.6 series but it has been removed.
>
> To be specific: the C/R support was removed from the v2.x branch because
> it is stale / not working.  The support is still in master, albeit with
> Adrian's disclaimers (it's stale / not working, but could be fixed with
> some work).
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/10/18256.php
>


Re: [OMPI devel] OMPI devel] Checkpoint/restart + migration

2015-10-26 Thread Gilles Gouaillardet

Federico,

that looks good to me.
the image does not show the channel between orded and its children.
this is a currently a TCP socket (v1.10) and we are moving to Unix 
socket (already in master)


Cheers,

Gilles

On 10/26/2015 3:28 PM, Federico Reghenzani wrote:

Hi Gilles,
t​​hank you again for your great answer. Our idea is to migrate tasks 
between nodes, possibly individually, and other tasks still run 
(obviously, if they want to communicate with "migrating" node, we 
should pause them).



Just to be sure if we have understood correctly, is the attached image 
exact?


Cheers,
Federico
__
Federico Reghenzani
M.Eng. Student @ Politecnico di Milano
Computer Science and Engineering



2015-10-23 11:45 GMT+02:00 Gilles Gouaillardet 
mailto:gilles.gouaillar...@gmail.com>>:


Gianmario,

Iirc, there is one pipe between orted and each children stderr.
stdout is a pty, and stdin is /dev/null, but it might be a pipe on
task 0
This is the way stdout/stderr from tasks end up being printed by
mpirun : orted does i/o forwarding (aka IOF)

are you trying to migrate only one task (and other tasks still
run) or are you trying to checkpoint and restart on a different
set of nodes ?

Typically, a task uses shared memory for intra node
communications, and infiniband or tcp for inter node communications.
So if you migrate only one task, and i assume you have no virtual
shared memory, then you need to notify its neighbors they have to
switch from shm to ib/tcp.
At first glance, that is much harder than moving orted and its
children :
You would "only" have to re-establish all connections and migrate
the shm.
Also, orted assumes/need its children are running on the same
node, (they use a session dir in /tmp, orted waits SIGCHLD when
its child dies,...) so if you migrate everything, you do not have
to worry about that part.

You might also want to consider some virtualization :
If a node is running in its own vm, or its own container with a
virtual ip, you could reuse existing infrastructure at least to
migrate orted and its tcp/ip connections

Cheers,

Gilles

Federico Reghenzani mailto:federico1.reghenz...@mail.polimi.it>> wrote:
Hi Adrian and Gilles,

first of all thank you for your responses. I'm working with
Gianmario on this ambitious project.

2015-10-22 13:17 GMT+02:00 Gilles Gouaillardet
mailto:gilles.gouaillar...@gmail.com>>:

Gianmario,

there was c/r support in the v1.6 series but it has been removed.
the current trend is to do application level checkpointing
(much more efficient and much smaller checkpoint file size)

iirc, ompi took care of closing/restoring all communication,
and a third party checkpoint was required to
checkpoint/restart *standalone* processes.

generally speaking, mpirun and orted communicate via tcp
orted and MPI (intra node comms) currently use tcp but we are
moving to unix sockets
MPI tasks communicate via btl (infiniband, tcp, shared memory,
...)


We have also seen that orted opens 2 pipe to each child, is it
correct? Does orted use them to communicate with children?

imho, moving only one MPI task to an other node is much
harder, not to say impossible, than moving orted and its
children MPI tasks to an other node


Mmm, I can ask you why? I mean, if we migrate the entire orted we
need to close/reopen /mpirun-orted/ and /task-task/ (btl) sockets,
and if we migrate the single task we need to close/reopen
/orte-task/ and /task-task /sockets. In both cases we have to
broadcast the information of "changing location" of the task or orted.

Cheers,

Gilles


On Thursday, October 22, 2015, Gianmario Pozzi
mailto:pozzigma...@gmail.com>> wrote:

Hi everyone!

My team and I are working on the possibility to checkpoint
a process and restarting it on another node. We are using
CRIU framework for the checkpoint/restart part, but we are
facing some issues related to migration.

First of all: we found out that some attempts to C/R an
OMPI process have been already made in the past. Is
anything related to that still supported/available/working?

Then, we need to know which network communications are
used at any time, in order to "pause" them during
migrations (at least the ones involving the migrating
node). Our code analysis makes us think that:
-OpenMPI runtime (HNP<->orteds) uses orte/OOB
-Running applications exchange data via ompi/BTL

Is that correct? If not, can someone give us a hint?

Questions on how to update topology info may be yet to come.

Thank you guys!

Gian

Re: [OMPI devel] OMPI devel] Checkpoint/restart + migration

2015-10-26 Thread Federico Reghenzani
Hi Gilles,
t​​hank you again for your great answer. Our idea is to migrate tasks
between nodes, possibly individually, and other tasks still run (obviously,
if they want to communicate with "migrating" node, we should pause them).


Just to be sure if we have understood correctly, is the attached image
exact?

Cheers,
Federico
__
Federico Reghenzani
M.Eng. Student @ Politecnico di Milano
Computer Science and Engineering



2015-10-23 11:45 GMT+02:00 Gilles Gouaillardet <
gilles.gouaillar...@gmail.com>:

> Gianmario,
>
> Iirc, there is one pipe between orted and each children stderr.
> stdout is a pty, and stdin is /dev/null, but it might be a pipe on task 0
> This is the way stdout/stderr from tasks end up being printed by mpirun :
> orted does i/o forwarding (aka IOF)
>
> are you trying to migrate only one task (and other tasks still run) or are
> you trying to checkpoint and restart on a different set of nodes ?
>
> Typically, a task uses shared memory for intra node communications, and
> infiniband or tcp for inter node communications.
> So if you migrate only one task, and i assume you have no virtual shared
> memory, then you need to notify its neighbors they have to switch from shm
> to ib/tcp.
> At first glance, that is much harder than moving orted and its children :
> You would "only" have to re-establish all connections and migrate the shm.
> Also, orted assumes/need its children are running on the same node, (they
> use a session dir in /tmp, orted waits SIGCHLD when its child dies,...) so
> if you migrate everything, you do not have to worry about that part.
>
> You might also want to consider some virtualization :
> If a node is running in its own vm, or its own container with a virtual
> ip, you could reuse existing infrastructure at least to migrate orted and
> its tcp/ip connections
>
> Cheers,
>
> Gilles
>
> Federico Reghenzani  wrote:
> Hi Adrian and Gilles,
>
> first of all thank you for your responses. I'm working with Gianmario on
> this ambitious project.
>
> 2015-10-22 13:17 GMT+02:00 Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com>:
>
>> Gianmario,
>>
>> there was c/r support in the v1.6 series but it has been removed.
>> the current trend is to do application level checkpointing
>> (much more efficient and much smaller checkpoint file size)
>>
>> iirc, ompi took care of closing/restoring all communication, and a third
>> party checkpoint was required to checkpoint/restart *standalone* processes.
>>
>> generally speaking, mpirun and orted communicate via tcp
>> orted and MPI (intra node comms) currently use tcp but we are moving to
>> unix sockets
>> MPI tasks communicate via btl (infiniband, tcp, shared memory, ...)
>>
>>
> We have also seen that orted opens 2 pipe to each child, is it correct?
> Does orted use them to communicate with children?
>
>
>
>> imho, moving only one MPI task to an other node is much harder, not to
>> say impossible, than moving orted and its children MPI tasks to an other
>> node
>>
>>
> Mmm, I can ask you why? I mean, if we migrate the entire orted we need to
> close/reopen *mpirun-orted* and *task-task* (btl) sockets, and if we
> migrate the single task we need to close/reopen *orte-task* and
> *task-task *sockets. In both cases we have to broadcast the information
> of "changing location" of the task or orted.
>
>
>
>> Cheers,
>>
>> Gilles
>>
>>
>> On Thursday, October 22, 2015, Gianmario Pozzi 
>> wrote:
>>
>>> Hi everyone!
>>>
>>> My team and I are working on the possibility to checkpoint a process and
>>> restarting it on another node. We are using CRIU framework for the
>>> checkpoint/restart part, but we are facing some issues related to migration.
>>>
>>> First of all: we found out that some attempts to C/R an OMPI process
>>> have been already made in the past. Is anything related to that still
>>> supported/available/working?
>>>
>>> Then, we need to know which network communications are used at any time,
>>> in order to "pause" them during migrations (at least the ones involving the
>>> migrating node). Our code analysis makes us think that:
>>> -OpenMPI runtime (HNP<->orteds) uses orte/OOB
>>> -Running applications exchange data via ompi/BTL
>>>
>>> Is that correct? If not, can someone give us a hint?
>>>
>>> Questions on how to update topology info may be yet to come.
>>>
>>> Thank you guys!
>>>
>>> Gianmario
>>>
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2015/10/18242.php
>>
>
>
> Cheers,
> Federico
> __
> Federico Reghenzani
> M.Eng. Student @ Politecnico di Milano
> Computer Science and Engineering
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/10/18253.ph

Re: [OMPI devel] OMPI devel] Checkpoint/restart + migration

2015-10-23 Thread George Bosilca
Each module has the opportunity to provide an ft_event function, that is
supposedly called when a change in the module behavior is necessary. Thus,
it is relatively easy to let the BTL knows about the fact that a particular
destination process will migrate to a new location.

  George.


On Fri, Oct 23, 2015 at 5:45 AM, Gilles Gouaillardet <
gilles.gouaillar...@gmail.com> wrote:

> Gianmario,
>
> Iirc, there is one pipe between orted and each children stderr.
> stdout is a pty, and stdin is /dev/null, but it might be a pipe on task 0
> This is the way stdout/stderr from tasks end up being printed by mpirun :
> orted does i/o forwarding (aka IOF)
>
> are you trying to migrate only one task (and other tasks still run) or are
> you trying to checkpoint and restart on a different set of nodes ?
>
> Typically, a task uses shared memory for intra node communications, and
> infiniband or tcp for inter node communications.
> So if you migrate only one task, and i assume you have no virtual shared
> memory, then you need to notify its neighbors they have to switch from shm
> to ib/tcp.
> At first glance, that is much harder than moving orted and its children :
> You would "only" have to re-establish all connections and migrate the shm.
> Also, orted assumes/need its children are running on the same node, (they
> use a session dir in /tmp, orted waits SIGCHLD when its child dies,...) so
> if you migrate everything, you do not have to worry about that part.
>
> You might also want to consider some virtualization :
> If a node is running in its own vm, or its own container with a virtual
> ip, you could reuse existing infrastructure at least to migrate orted and
> its tcp/ip connections
>
> Cheers,
>
> Gilles
>
> Federico Reghenzani  wrote:
> Hi Adrian and Gilles,
>
> first of all thank you for your responses. I'm working with Gianmario on
> this ambitious project.
>
> 2015-10-22 13:17 GMT+02:00 Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com>:
>
>> Gianmario,
>>
>> there was c/r support in the v1.6 series but it has been removed.
>> the current trend is to do application level checkpointing
>> (much more efficient and much smaller checkpoint file size)
>>
>> iirc, ompi took care of closing/restoring all communication, and a third
>> party checkpoint was required to checkpoint/restart *standalone* processes.
>>
>> generally speaking, mpirun and orted communicate via tcp
>> orted and MPI (intra node comms) currently use tcp but we are moving to
>> unix sockets
>> MPI tasks communicate via btl (infiniband, tcp, shared memory, ...)
>>
>>
> We have also seen that orted opens 2 pipe to each child, is it correct?
> Does orted use them to communicate with children?
>
>
>
>> imho, moving only one MPI task to an other node is much harder, not to
>> say impossible, than moving orted and its children MPI tasks to an other
>> node
>>
>>
> Mmm, I can ask you why? I mean, if we migrate the entire orted we need to
> close/reopen *mpirun-orted* and *task-task* (btl) sockets, and if we
> migrate the single task we need to close/reopen *orte-task* and
> *task-task *sockets. In both cases we have to broadcast the information
> of "changing location" of the task or orted.
>
>
>
>> Cheers,
>>
>> Gilles
>>
>>
>> On Thursday, October 22, 2015, Gianmario Pozzi 
>> wrote:
>>
>>> Hi everyone!
>>>
>>> My team and I are working on the possibility to checkpoint a process and
>>> restarting it on another node. We are using CRIU framework for the
>>> checkpoint/restart part, but we are facing some issues related to migration.
>>>
>>> First of all: we found out that some attempts to C/R an OMPI process
>>> have been already made in the past. Is anything related to that still
>>> supported/available/working?
>>>
>>> Then, we need to know which network communications are used at any time,
>>> in order to "pause" them during migrations (at least the ones involving the
>>> migrating node). Our code analysis makes us think that:
>>> -OpenMPI runtime (HNP<->orteds) uses orte/OOB
>>> -Running applications exchange data via ompi/BTL
>>>
>>> Is that correct? If not, can someone give us a hint?
>>>
>>> Questions on how to update topology info may be yet to come.
>>>
>>> Thank you guys!
>>>
>>> Gianmario
>>>
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2015/10/18242.php
>>
>
>
> Cheers,
> Federico
> __
> Federico Reghenzani
> M.Eng. Student @ Politecnico di Milano
> Computer Science and Engineering
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/10/18253.php
>


Re: [OMPI devel] Checkpoint/restart + migration

2015-10-23 Thread Jeff Squyres (jsquyres)
On Oct 22, 2015, at 7:17 AM, Gilles Gouaillardet 
 wrote:
> 
> Gianmario,
> 
> there was c/r support in the v1.6 series but it has been removed.

To be specific: the C/R support was removed from the v2.x branch because it is 
stale / not working.  The support is still in master, albeit with Adrian's 
disclaimers (it's stale / not working, but could be fixed with some work).

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] OMPI devel] Checkpoint/restart + migration

2015-10-23 Thread Gilles Gouaillardet
Gianmario,

Iirc, there is one pipe between orted and each children stderr.
stdout is a pty, and stdin is /dev/null, but it might be a pipe on task 0
This is the way stdout/stderr from tasks end up being printed by mpirun : orted 
does i/o forwarding (aka IOF)

are you trying to migrate only one task (and other tasks still run) or are you 
trying to checkpoint and restart on a different set of nodes ?

Typically, a task uses shared memory for intra node communications, and 
infiniband or tcp for inter node communications.
So if you migrate only one task, and i assume you have no virtual shared 
memory, then you need to notify its neighbors they have to switch from shm to 
ib/tcp.
At first glance, that is much harder than moving orted and its children :
You would "only" have to re-establish all connections and migrate the shm.
Also, orted assumes/need its children are running on the same node, (they use a 
session dir in /tmp, orted waits SIGCHLD when its child dies,...) so if you 
migrate everything, you do not have to worry about that part.

You might also want to consider some virtualization :
If a node is running in its own vm, or its own container with a virtual ip, you 
could reuse existing infrastructure at least to migrate orted and its tcp/ip 
connections

Cheers,

Gilles

Federico Reghenzani  wrote:
>Hi Adrian and Gilles,
>
>
>first of all thank you for your responses. I'm working with Gianmario on this 
>ambitious project.
>
>
>2015-10-22 13:17 GMT+02:00 Gilles Gouaillardet :
>
>Gianmario,
>
>
>there was c/r support in the v1.6 series but it has been removed.
>
>the current trend is to do application level checkpointing
>
>(much more efficient and much smaller checkpoint file size)
>
>
>iirc, ompi took care of closing/restoring all communication, and a third party 
>checkpoint was required to checkpoint/restart *standalone* processes.
>
>
>generally speaking, mpirun and orted communicate via tcp
>
>orted and MPI (intra node comms) currently use tcp but we are moving to unix 
>sockets
>
>MPI tasks communicate via btl (infiniband, tcp, shared memory, ...)
>
>
>
>We have also seen that orted opens 2 pipe to each child, is it correct? Does 
>orted use them to communicate with children?  
>
>
> 
>
>imho, moving only one MPI task to an other node is much harder, not to say 
>impossible, than moving orted and its children MPI tasks to an other node
>
>
>
>Mmm, I can ask you why? I mean, if we migrate the entire orted we need to 
>close/reopen mpirun-orted and task-task (btl) sockets, and if we migrate the 
>single task we need to close/reopen orte-task and task-task sockets. In both 
>cases we have to broadcast the information of "changing location" of the task 
>or orted.
>
>
> 
>
>Cheers,
>
>
>Gilles
>
>
>
>On Thursday, October 22, 2015, Gianmario Pozzi  wrote:
>
>Hi everyone!
>
>
>My team and I are working on the possibility to checkpoint a process and 
>restarting it on another node. We are using CRIU framework for the 
>checkpoint/restart part, but we are facing some issues related to migration.
>
>
>First of all: we found out that some attempts to C/R an OMPI process have been 
>already made in the past. Is anything related to that still 
>supported/available/working?
>
>
>Then, we need to know which network communications are used at any time, in 
>order to "pause" them during migrations (at least the ones involving the 
>migrating node). Our code analysis makes us think that:
>
>-OpenMPI runtime (HNP<->orteds) uses orte/OOB
>
>-Running applications exchange data via ompi/BTL
>
>
>Is that correct? If not, can someone give us a hint?
>
>
>Questions on how to update topology info may be yet to come.
>
>
>Thank you guys!
>
>
>Gianmario
>
>
>___
>devel mailing list
>de...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>Link to this post: 
>http://www.open-mpi.org/community/lists/devel/2015/10/18242.php
>
>
>
>Cheers,
>Federico
>
>__
>
>Federico Reghenzani
>
>M.Eng. Student @ Politecnico di Milano
>
>Computer Science and Engineering
>
>


Re: [OMPI devel] Checkpoint/restart + migration

2015-10-23 Thread Federico Reghenzani
Hi Adrian and Gilles,

first of all thank you for your responses. I'm working with Gianmario on
this ambitious project.

2015-10-22 13:17 GMT+02:00 Gilles Gouaillardet <
gilles.gouaillar...@gmail.com>:

> Gianmario,
>
> there was c/r support in the v1.6 series but it has been removed.
> the current trend is to do application level checkpointing
> (much more efficient and much smaller checkpoint file size)
>
> iirc, ompi took care of closing/restoring all communication, and a third
> party checkpoint was required to checkpoint/restart *standalone* processes.
>
> generally speaking, mpirun and orted communicate via tcp
> orted and MPI (intra node comms) currently use tcp but we are moving to
> unix sockets
> MPI tasks communicate via btl (infiniband, tcp, shared memory, ...)
>
>
We have also seen that orted opens 2 pipe to each child, is it correct?
Does orted use them to communicate with children?



> imho, moving only one MPI task to an other node is much harder, not to say
> impossible, than moving orted and its children MPI tasks to an other node
>
>
Mmm, I can ask you why? I mean, if we migrate the entire orted we need to
close/reopen *mpirun-orted* and *task-task* (btl) sockets, and if we
migrate the single task we need to close/reopen *orte-task* and
*task-task *sockets.
In both cases we have to broadcast the information of "changing location"
of the task or orted.



> Cheers,
>
> Gilles
>
>
> On Thursday, October 22, 2015, Gianmario Pozzi 
> wrote:
>
>> Hi everyone!
>>
>> My team and I are working on the possibility to checkpoint a process and
>> restarting it on another node. We are using CRIU framework for the
>> checkpoint/restart part, but we are facing some issues related to migration.
>>
>> First of all: we found out that some attempts to C/R an OMPI process have
>> been already made in the past. Is anything related to that still
>> supported/available/working?
>>
>> Then, we need to know which network communications are used at any time,
>> in order to "pause" them during migrations (at least the ones involving the
>> migrating node). Our code analysis makes us think that:
>> -OpenMPI runtime (HNP<->orteds) uses orte/OOB
>> -Running applications exchange data via ompi/BTL
>>
>> Is that correct? If not, can someone give us a hint?
>>
>> Questions on how to update topology info may be yet to come.
>>
>> Thank you guys!
>>
>> Gianmario
>>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/10/18242.php
>


Cheers,
Federico
__
Federico Reghenzani
M.Eng. Student @ Politecnico di Milano
Computer Science and Engineering


Re: [OMPI devel] Checkpoint/restart + migration

2015-10-22 Thread Gilles Gouaillardet
Gianmario,

there was c/r support in the v1.6 series but it has been removed.
the current trend is to do application level checkpointing
(much more efficient and much smaller checkpoint file size)

iirc, ompi took care of closing/restoring all communication, and a third
party checkpoint was required to checkpoint/restart *standalone* processes.

generally speaking, mpirun and orted communicate via tcp
orted and MPI (intra node comms) currently use tcp but we are moving to
unix sockets
MPI tasks communicate via btl (infiniband, tcp, shared memory, ...)

imho, moving only one MPI task to an other node is much harder, not to say
impossible, than moving orted and its children MPI tasks to an other node

Cheers,

Gilles

On Thursday, October 22, 2015, Gianmario Pozzi 
wrote:

> Hi everyone!
>
> My team and I are working on the possibility to checkpoint a process and
> restarting it on another node. We are using CRIU framework for the
> checkpoint/restart part, but we are facing some issues related to migration.
>
> First of all: we found out that some attempts to C/R an OMPI process have
> been already made in the past. Is anything related to that still
> supported/available/working?
>
> Then, we need to know which network communications are used at any time,
> in order to "pause" them during migrations (at least the ones involving the
> migrating node). Our code analysis makes us think that:
> -OpenMPI runtime (HNP<->orteds) uses orte/OOB
> -Running applications exchange data via ompi/BTL
>
> Is that correct? If not, can someone give us a hint?
>
> Questions on how to update topology info may be yet to come.
>
> Thank you guys!
>
> Gianmario
>


Re: [OMPI devel] Checkpoint/restart + migration

2015-10-22 Thread Adrian Reber
On Thu, Oct 22, 2015 at 12:15:22PM +0200, Gianmario Pozzi wrote:
> My team and I are working on the possibility to checkpoint a process and
> restarting it on another node. We are using CRIU framework for the
> checkpoint/restart part, but we are facing some issues related to migration.
> 
> First of all: we found out that some attempts to C/R an OMPI process have
> been already made in the past. Is anything related to that still
> supported/available/working?

I was working on the CRIU <-> OpenMPI integration during 2013/2014. The
code is still available at:

https://github.com/open-mpi/ompi/tree/master/opal/mca/crs/criu

I was able to checkpoint and restart a process under OpenMPI's control:

http://lisas.de/~adrian/?p=926

>From what I have heard/read OpenMPI has probably had enough internal
changes that the Fault Tolerance framework is currently no longer
working which is needed to use the checkpoint/restart functionality.

In addition, CRIU has also changed a bit. I used the criu service daemon
to start the checkpoint. This service daemon no longer exists due to
security concerns:

https://lwn.net/Articles/658070/

So you either need to call the criu binary directly or you can use 'criu
swrk'.

Restore should be easier as criu now supports the option --inherit-fd
which should help to correctly re-route stdin/stdout/stderr.

Adrian


[OMPI devel] Checkpoint/restart + migration

2015-10-22 Thread Gianmario Pozzi
Hi everyone!

My team and I are working on the possibility to checkpoint a process and
restarting it on another node. We are using CRIU framework for the
checkpoint/restart part, but we are facing some issues related to migration.

First of all: we found out that some attempts to C/R an OMPI process have
been already made in the past. Is anything related to that still
supported/available/working?

Then, we need to know which network communications are used at any time, in
order to "pause" them during migrations (at least the ones involving the
migrating node). Our code analysis makes us think that:
-OpenMPI runtime (HNP<->orteds) uses orte/OOB
-Running applications exchange data via ompi/BTL

Is that correct? If not, can someone give us a hint?

Questions on how to update topology info may be yet to come.

Thank you guys!

Gianmario