Gianmario,

there was c/r support in the v1.6 series but it has been removed.
the current trend is to do application level checkpointing
(much more efficient and much smaller checkpoint file size)

iirc, ompi took care of closing/restoring all communication, and a third
party checkpoint was required to checkpoint/restart *standalone* processes.

generally speaking, mpirun and orted communicate via tcp
orted and MPI (intra node comms) currently use tcp but we are moving to
unix sockets
MPI tasks communicate via btl (infiniband, tcp, shared memory, ...)

imho, moving only one MPI task to an other node is much harder, not to say
impossible, than moving orted and its children MPI tasks to an other node

Cheers,

Gilles

On Thursday, October 22, 2015, Gianmario Pozzi <pozzigma...@gmail.com>
wrote:

> Hi everyone!
>
> My team and I are working on the possibility to checkpoint a process and
> restarting it on another node. We are using CRIU framework for the
> checkpoint/restart part, but we are facing some issues related to migration.
>
> First of all: we found out that some attempts to C/R an OMPI process have
> been already made in the past. Is anything related to that still
> supported/available/working?
>
> Then, we need to know which network communications are used at any time,
> in order to "pause" them during migrations (at least the ones involving the
> migrating node). Our code analysis makes us think that:
> -OpenMPI runtime (HNP<->orteds) uses orte/OOB
> -Running applications exchange data via ompi/BTL
>
> Is that correct? If not, can someone give us a hint?
>
> Questions on how to update topology info may be yet to come.
>
> Thank you guys!
>
> Gianmario
>

Reply via email to