Re: [OMPI devel] OMPI devel] Checkpoint/restart + migration

Gilles Gouaillardet Mon, 26 Oct 2015 03:04:36 -0400 (EDT)

Federico,

that looks good to me.
the image does not show the channel between orded and its children.

this is a currently a TCP socket (v1.10) and we are moving to Unixsocket (already in master)


Cheers,

Gilles

On 10/26/2015 3:28 PM, Federico Reghenzani wrote:

Hi Gilles,

thank you again for your great answer. Our idea is to migrate tasksbetween nodes, possibly individually, and other tasks still run(obviously, if they want to communicate with "migrating" node, weshould pause them).

Just to be sure if we have understood correctly, is the attached imageexact?


Cheers,
Federico
__
Federico Reghenzani
M.Eng. Student @ Politecnico di Milano
Computer Science and Engineering

2015-10-23 11:45 GMT+02:00 Gilles Gouaillardet<[email protected] <mailto:[email protected]>>:


    Gianmario,

    Iirc, there is one pipe between orted and each children stderr.
    stdout is a pty, and stdin is /dev/null, but it might be a pipe on
    task 0
    This is the way stdout/stderr from tasks end up being printed by
    mpirun : orted does i/o forwarding (aka IOF)

    are you trying to migrate only one task (and other tasks still
    run) or are you trying to checkpoint and restart on a different
    set of nodes ?

    Typically, a task uses shared memory for intra node
    communications, and infiniband or tcp for inter node communications.
    So if you migrate only one task, and i assume you have no virtual
    shared memory, then you need to notify its neighbors they have to
    switch from shm to ib/tcp.
    At first glance, that is much harder than moving orted and its
    children :
    You would "only" have to re-establish all connections and migrate
    the shm.
    Also, orted assumes/need its children are running on the same
    node, (they use a session dir in /tmp, orted waits SIGCHLD when
    its child dies,...) so if you migrate everything, you do not have
    to worry about that part.

    You might also want to consider some virtualization :
    If a node is running in its own vm, or its own container with a
    virtual ip, you could reuse existing infrastructure at least to
    migrate orted and its tcp/ip connections

    Cheers,

    Gilles

    Federico Reghenzani <[email protected]
    <mailto:[email protected]>> wrote:
    Hi Adrian and Gilles,

    first of all thank you for your responses. I'm working with
    Gianmario on this ambitious project.

    2015-10-22 13:17 GMT+02:00 Gilles Gouaillardet
    <[email protected]
    <mailto:[email protected]>>:

        Gianmario,

        there was c/r support in the v1.6 series but it has been removed.
        the current trend is to do application level checkpointing
        (much more efficient and much smaller checkpoint file size)

        iirc, ompi took care of closing/restoring all communication,
        and a third party checkpoint was required to
        checkpoint/restart *standalone* processes.

        generally speaking, mpirun and orted communicate via tcp
        orted and MPI (intra node comms) currently use tcp but we are
        moving to unix sockets
        MPI tasks communicate via btl (infiniband, tcp, shared memory,
        ...)


    We have also seen that orted opens 2 pipe to each child, is it
    correct? Does orted use them to communicate with children?

        imho, moving only one MPI task to an other node is much
        harder, not to say impossible, than moving orted and its
        children MPI tasks to an other node


    Mmm, I can ask you why? I mean, if we migrate the entire orted we
    need to close/reopen /mpirun-orted/ and /task-task/ (btl) sockets,
    and if we migrate the single task we need to close/reopen
    /orte-task/ and /task-task /sockets. In both cases we have to
    broadcast the information of "changing location" of the task or orted.

        Cheers,

        Gilles


        On Thursday, October 22, 2015, Gianmario Pozzi
        <[email protected] <mailto:[email protected]>> wrote:

            Hi everyone!

            My team and I are working on the possibility to checkpoint
            a process and restarting it on another node. We are using
            CRIU framework for the checkpoint/restart part, but we are
            facing some issues related to migration.

            First of all: we found out that some attempts to C/R an
            OMPI process have been already made in the past. Is
            anything related to that still supported/available/working?

            Then, we need to know which network communications are
            used at any time, in order to "pause" them during
            migrations (at least the ones involving the migrating
            node). Our code analysis makes us think that:
            -OpenMPI runtime (HNP<->orteds) uses orte/OOB
            -Running applications exchange data via ompi/BTL

            Is that correct? If not, can someone give us a hint?

            Questions on how to update topology info may be yet to come.

            Thank you guys!

            Gianmario


        _______________________________________________
        devel mailing list
        [email protected] <mailto:[email protected]>
        Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
        Link to this post:
        http://www.open-mpi.org/community/lists/devel/2015/10/18242.php



    Cheers,
    Federico
    __
    Federico Reghenzani
    M.Eng. Student @ Politecnico di Milano
    Computer Science and Engineering


    _______________________________________________
    devel mailing list
    [email protected] <mailto:[email protected]>
    Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
    Link to this post:
    http://www.open-mpi.org/community/lists/devel/2015/10/18253.php




_______________________________________________
devel mailing list
[email protected]
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2015/10/18267.php

Re: [OMPI devel] OMPI devel] Checkpoint/restart + migration

Reply via email to