Federico,
that looks good to me.
the image does not show the channel between orded and its children.
this is a currently a TCP socket (v1.10) and we are moving to Unix
socket (already in master)
Cheers,
Gilles
On 10/26/2015 3:28 PM, Federico Reghenzani wrote:
Hi Gilles,
thank you again for your great answer. Our idea is to migrate tasks
between nodes, possibly individually, and other tasks still run
(obviously, if they want to communicate with "migrating" node, we
should pause them).
Just to be sure if we have understood correctly, is the attached image
exact?
Cheers,
Federico
__
Federico Reghenzani
M.Eng. Student @ Politecnico di Milano
Computer Science and Engineering
2015-10-23 11:45 GMT+02:00 Gilles Gouaillardet
<[email protected] <mailto:[email protected]>>:
Gianmario,
Iirc, there is one pipe between orted and each children stderr.
stdout is a pty, and stdin is /dev/null, but it might be a pipe on
task 0
This is the way stdout/stderr from tasks end up being printed by
mpirun : orted does i/o forwarding (aka IOF)
are you trying to migrate only one task (and other tasks still
run) or are you trying to checkpoint and restart on a different
set of nodes ?
Typically, a task uses shared memory for intra node
communications, and infiniband or tcp for inter node communications.
So if you migrate only one task, and i assume you have no virtual
shared memory, then you need to notify its neighbors they have to
switch from shm to ib/tcp.
At first glance, that is much harder than moving orted and its
children :
You would "only" have to re-establish all connections and migrate
the shm.
Also, orted assumes/need its children are running on the same
node, (they use a session dir in /tmp, orted waits SIGCHLD when
its child dies,...) so if you migrate everything, you do not have
to worry about that part.
You might also want to consider some virtualization :
If a node is running in its own vm, or its own container with a
virtual ip, you could reuse existing infrastructure at least to
migrate orted and its tcp/ip connections
Cheers,
Gilles
Federico Reghenzani <[email protected]
<mailto:[email protected]>> wrote:
Hi Adrian and Gilles,
first of all thank you for your responses. I'm working with
Gianmario on this ambitious project.
2015-10-22 13:17 GMT+02:00 Gilles Gouaillardet
<[email protected]
<mailto:[email protected]>>:
Gianmario,
there was c/r support in the v1.6 series but it has been removed.
the current trend is to do application level checkpointing
(much more efficient and much smaller checkpoint file size)
iirc, ompi took care of closing/restoring all communication,
and a third party checkpoint was required to
checkpoint/restart *standalone* processes.
generally speaking, mpirun and orted communicate via tcp
orted and MPI (intra node comms) currently use tcp but we are
moving to unix sockets
MPI tasks communicate via btl (infiniband, tcp, shared memory,
...)
We have also seen that orted opens 2 pipe to each child, is it
correct? Does orted use them to communicate with children?
imho, moving only one MPI task to an other node is much
harder, not to say impossible, than moving orted and its
children MPI tasks to an other node
Mmm, I can ask you why? I mean, if we migrate the entire orted we
need to close/reopen /mpirun-orted/ and /task-task/ (btl) sockets,
and if we migrate the single task we need to close/reopen
/orte-task/ and /task-task /sockets. In both cases we have to
broadcast the information of "changing location" of the task or orted.
Cheers,
Gilles
On Thursday, October 22, 2015, Gianmario Pozzi
<[email protected] <mailto:[email protected]>> wrote:
Hi everyone!
My team and I are working on the possibility to checkpoint
a process and restarting it on another node. We are using
CRIU framework for the checkpoint/restart part, but we are
facing some issues related to migration.
First of all: we found out that some attempts to C/R an
OMPI process have been already made in the past. Is
anything related to that still supported/available/working?
Then, we need to know which network communications are
used at any time, in order to "pause" them during
migrations (at least the ones involving the migrating
node). Our code analysis makes us think that:
-OpenMPI runtime (HNP<->orteds) uses orte/OOB
-Running applications exchange data via ompi/BTL
Is that correct? If not, can someone give us a hint?
Questions on how to update topology info may be yet to come.
Thank you guys!
Gianmario
_______________________________________________
devel mailing list
[email protected] <mailto:[email protected]>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post:
http://www.open-mpi.org/community/lists/devel/2015/10/18242.php
Cheers,
Federico
__
Federico Reghenzani
M.Eng. Student @ Politecnico di Milano
Computer Science and Engineering
_______________________________________________
devel mailing list
[email protected] <mailto:[email protected]>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post:
http://www.open-mpi.org/community/lists/devel/2015/10/18253.php
_______________________________________________
devel mailing list
[email protected]
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post:
http://www.open-mpi.org/community/lists/devel/2015/10/18267.php