Hi Adrian and Gilles,

first of all thank you for your responses. I'm working with Gianmario on
this ambitious project.

2015-10-22 13:17 GMT+02:00 Gilles Gouaillardet <
[email protected]>:

> Gianmario,
>
> there was c/r support in the v1.6 series but it has been removed.
> the current trend is to do application level checkpointing
> (much more efficient and much smaller checkpoint file size)
>
> iirc, ompi took care of closing/restoring all communication, and a third
> party checkpoint was required to checkpoint/restart *standalone* processes.
>
> generally speaking, mpirun and orted communicate via tcp
> orted and MPI (intra node comms) currently use tcp but we are moving to
> unix sockets
> MPI tasks communicate via btl (infiniband, tcp, shared memory, ...)
>
>
We have also seen that orted opens 2 pipe to each child, is it correct?
Does orted use them to communicate with children?



> imho, moving only one MPI task to an other node is much harder, not to say
> impossible, than moving orted and its children MPI tasks to an other node
>
>
Mmm, I can ask you why? I mean, if we migrate the entire orted we need to
close/reopen *mpirun-orted* and *task-task* (btl) sockets, and if we
migrate the single task we need to close/reopen *orte-task* and
*task-task *sockets.
In both cases we have to broadcast the information of "changing location"
of the task or orted.



> Cheers,
>
> Gilles
>
>
> On Thursday, October 22, 2015, Gianmario Pozzi <[email protected]>
> wrote:
>
>> Hi everyone!
>>
>> My team and I are working on the possibility to checkpoint a process and
>> restarting it on another node. We are using CRIU framework for the
>> checkpoint/restart part, but we are facing some issues related to migration.
>>
>> First of all: we found out that some attempts to C/R an OMPI process have
>> been already made in the past. Is anything related to that still
>> supported/available/working?
>>
>> Then, we need to know which network communications are used at any time,
>> in order to "pause" them during migrations (at least the ones involving the
>> migrating node). Our code analysis makes us think that:
>> -OpenMPI runtime (HNP<->orteds) uses orte/OOB
>> -Running applications exchange data via ompi/BTL
>>
>> Is that correct? If not, can someone give us a hint?
>>
>> Questions on how to update topology info may be yet to come.
>>
>> Thank you guys!
>>
>> Gianmario
>>
>
> _______________________________________________
> devel mailing list
> [email protected]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/10/18242.php
>


Cheers,
Federico
__
Federico Reghenzani
M.Eng. Student @ Politecnico di Milano
Computer Science and Engineering

Reply via email to