Re: [OMPI devel] OMPI devel] Checkpoint/restart + migration
2015-10-26 8:04 GMT+01:00 Gilles Gouaillardet : > Federico, > > that looks good to me. > the image does not show the channel between orded and its children. > this is a currently a TCP socket (v1.10) and we are moving to Unix socket > (already in master) > > Which is the framework involved in this communication? I'm not sure what this channel is used for. Cheers, > > Gilles > > > On 10/26/2015 3:28 PM, Federico Reghenzani wrote: > > Hi Gilles, > thank you again for your great answer. Our idea is to migrate tasks > between nodes, possibly individually, and other tasks still run (obviously, > if they want to communicate with "migrating" node, we should pause them). > > > Just to be sure if we have understood correctly, is the attached image > exact? > > Cheers, > Federico > __ > Federico Reghenzani > M.Eng. Student @ Politecnico di Milano > Computer Science and Engineering > > > > 2015-10-23 11:45 GMT+02:00 Gilles Gouaillardet < > gilles.gouaillar...@gmail.com>: > >> Gianmario, >> >> Iirc, there is one pipe between orted and each children stderr. >> stdout is a pty, and stdin is /dev/null, but it might be a pipe on task 0 >> This is the way stdout/stderr from tasks end up being printed by mpirun : >> orted does i/o forwarding (aka IOF) >> >> are you trying to migrate only one task (and other tasks still run) or >> are you trying to checkpoint and restart on a different set of nodes ? >> >> Typically, a task uses shared memory for intra node communications, and >> infiniband or tcp for inter node communications. >> So if you migrate only one task, and i assume you have no virtual shared >> memory, then you need to notify its neighbors they have to switch from shm >> to ib/tcp. >> At first glance, that is much harder than moving orted and its children : >> You would "only" have to re-establish all connections and migrate the shm. >> Also, orted assumes/need its children are running on the same node, (they >> use a session dir in /tmp, orted waits SIGCHLD when its child dies,...) so >> if you migrate everything, you do not have to worry about that part. >> >> You might also want to consider some virtualization : >> If a node is running in its own vm, or its own container with a virtual >> ip, you could reuse existing infrastructure at least to migrate orted and >> its tcp/ip connections >> >> Cheers, >> >> Gilles >> >> Federico Reghenzani wrote: >> Hi Adrian and Gilles, >> >> first of all thank you for your responses. I'm working with Gianmario on >> this ambitious project. >> >> 2015-10-22 13:17 GMT+02:00 Gilles Gouaillardet < >> gilles.gouaillar...@gmail.com>: >> >>> Gianmario, >>> >>> there was c/r support in the v1.6 series but it has been removed. >>> the current trend is to do application level checkpointing >>> (much more efficient and much smaller checkpoint file size) >>> >>> iirc, ompi took care of closing/restoring all communication, and a third >>> party checkpoint was required to checkpoint/restart *standalone* processes. >>> >>> generally speaking, mpirun and orted communicate via tcp >>> orted and MPI (intra node comms) currently use tcp but we are moving to >>> unix sockets >>> MPI tasks communicate via btl (infiniband, tcp, shared memory, ...) >>> >>> >> We have also seen that orted opens 2 pipe to each child, is it correct? >> Does orted use them to communicate with children? >> >> >> >>> imho, moving only one MPI task to an other node is much harder, not to >>> say impossible, than moving orted and its children MPI tasks to an other >>> node >>> >>> >> Mmm, I can ask you why? I mean, if we migrate the entire orted we need to >> close/reopen *mpirun-orted* and *task-task* (btl) sockets, and if we >> migrate the single task we need to close/reopen *orte-task* and >> *task-task *sockets. In both cases we have to broadcast the information >> of "changing location" of the task or orted. >> >> >> >>> Cheers, >>> >>> Gilles >>> >>> >>> On Thursday, October 22, 2015, Gianmario Pozzi >>> wrote: >>> Hi everyone! My team and I are working on the possibility to checkpoint a process and restarting it on another node. We are using CRIU framework for the checkpoint/restart part, but we are facing some issues related to migration. First of all: we found out that some attempts to C/R an OMPI process have been already made in the past. Is anything related to that still supported/available/working? Then, we need to know which network communications are used at any time, in order to "pause" them during migrations (at least the ones involving the migrating node). Our code analysis makes us think that: -OpenMPI runtime (HNP<->orteds) uses orte/OOB -Running applications exchange data via ompi/BTL Is that correct? If not, can someone give us a hint? Questions on how to update topology info may be yet to come. Thank you guys! Gianmario >>> >>> ___
Re: [OMPI devel] Checkpoint/restart + migration
Thank you guys, your help is really appriciated! We'll keep in touch for further information. Gianmario Il 23/ott/2015 12:44 "Jeff Squyres (jsquyres)" ha scritto: > On Oct 22, 2015, at 7:17 AM, Gilles Gouaillardet < > gilles.gouaillar...@gmail.com> wrote: > > > > Gianmario, > > > > there was c/r support in the v1.6 series but it has been removed. > > To be specific: the C/R support was removed from the v2.x branch because > it is stale / not working. The support is still in master, albeit with > Adrian's disclaimers (it's stale / not working, but could be fixed with > some work). > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/10/18256.php >
Re: [OMPI devel] OMPI devel] Checkpoint/restart + migration
Federico, that looks good to me. the image does not show the channel between orded and its children. this is a currently a TCP socket (v1.10) and we are moving to Unix socket (already in master) Cheers, Gilles On 10/26/2015 3:28 PM, Federico Reghenzani wrote: Hi Gilles, thank you again for your great answer. Our idea is to migrate tasks between nodes, possibly individually, and other tasks still run (obviously, if they want to communicate with "migrating" node, we should pause them). Just to be sure if we have understood correctly, is the attached image exact? Cheers, Federico __ Federico Reghenzani M.Eng. Student @ Politecnico di Milano Computer Science and Engineering 2015-10-23 11:45 GMT+02:00 Gilles Gouaillardet mailto:gilles.gouaillar...@gmail.com>>: Gianmario, Iirc, there is one pipe between orted and each children stderr. stdout is a pty, and stdin is /dev/null, but it might be a pipe on task 0 This is the way stdout/stderr from tasks end up being printed by mpirun : orted does i/o forwarding (aka IOF) are you trying to migrate only one task (and other tasks still run) or are you trying to checkpoint and restart on a different set of nodes ? Typically, a task uses shared memory for intra node communications, and infiniband or tcp for inter node communications. So if you migrate only one task, and i assume you have no virtual shared memory, then you need to notify its neighbors they have to switch from shm to ib/tcp. At first glance, that is much harder than moving orted and its children : You would "only" have to re-establish all connections and migrate the shm. Also, orted assumes/need its children are running on the same node, (they use a session dir in /tmp, orted waits SIGCHLD when its child dies,...) so if you migrate everything, you do not have to worry about that part. You might also want to consider some virtualization : If a node is running in its own vm, or its own container with a virtual ip, you could reuse existing infrastructure at least to migrate orted and its tcp/ip connections Cheers, Gilles Federico Reghenzani mailto:federico1.reghenz...@mail.polimi.it>> wrote: Hi Adrian and Gilles, first of all thank you for your responses. I'm working with Gianmario on this ambitious project. 2015-10-22 13:17 GMT+02:00 Gilles Gouaillardet mailto:gilles.gouaillar...@gmail.com>>: Gianmario, there was c/r support in the v1.6 series but it has been removed. the current trend is to do application level checkpointing (much more efficient and much smaller checkpoint file size) iirc, ompi took care of closing/restoring all communication, and a third party checkpoint was required to checkpoint/restart *standalone* processes. generally speaking, mpirun and orted communicate via tcp orted and MPI (intra node comms) currently use tcp but we are moving to unix sockets MPI tasks communicate via btl (infiniband, tcp, shared memory, ...) We have also seen that orted opens 2 pipe to each child, is it correct? Does orted use them to communicate with children? imho, moving only one MPI task to an other node is much harder, not to say impossible, than moving orted and its children MPI tasks to an other node Mmm, I can ask you why? I mean, if we migrate the entire orted we need to close/reopen /mpirun-orted/ and /task-task/ (btl) sockets, and if we migrate the single task we need to close/reopen /orte-task/ and /task-task /sockets. In both cases we have to broadcast the information of "changing location" of the task or orted. Cheers, Gilles On Thursday, October 22, 2015, Gianmario Pozzi mailto:pozzigma...@gmail.com>> wrote: Hi everyone! My team and I are working on the possibility to checkpoint a process and restarting it on another node. We are using CRIU framework for the checkpoint/restart part, but we are facing some issues related to migration. First of all: we found out that some attempts to C/R an OMPI process have been already made in the past. Is anything related to that still supported/available/working? Then, we need to know which network communications are used at any time, in order to "pause" them during migrations (at least the ones involving the migrating node). Our code analysis makes us think that: -OpenMPI runtime (HNP<->orteds) uses orte/OOB -Running applications exchange data via ompi/BTL Is that correct? If not, can someone give us a hint? Questions on how to update topology info may be yet to come. Thank you guys! Gian
Re: [OMPI devel] OMPI devel] Checkpoint/restart + migration
Hi Gilles, thank you again for your great answer. Our idea is to migrate tasks between nodes, possibly individually, and other tasks still run (obviously, if they want to communicate with "migrating" node, we should pause them). Just to be sure if we have understood correctly, is the attached image exact? Cheers, Federico __ Federico Reghenzani M.Eng. Student @ Politecnico di Milano Computer Science and Engineering 2015-10-23 11:45 GMT+02:00 Gilles Gouaillardet < gilles.gouaillar...@gmail.com>: > Gianmario, > > Iirc, there is one pipe between orted and each children stderr. > stdout is a pty, and stdin is /dev/null, but it might be a pipe on task 0 > This is the way stdout/stderr from tasks end up being printed by mpirun : > orted does i/o forwarding (aka IOF) > > are you trying to migrate only one task (and other tasks still run) or are > you trying to checkpoint and restart on a different set of nodes ? > > Typically, a task uses shared memory for intra node communications, and > infiniband or tcp for inter node communications. > So if you migrate only one task, and i assume you have no virtual shared > memory, then you need to notify its neighbors they have to switch from shm > to ib/tcp. > At first glance, that is much harder than moving orted and its children : > You would "only" have to re-establish all connections and migrate the shm. > Also, orted assumes/need its children are running on the same node, (they > use a session dir in /tmp, orted waits SIGCHLD when its child dies,...) so > if you migrate everything, you do not have to worry about that part. > > You might also want to consider some virtualization : > If a node is running in its own vm, or its own container with a virtual > ip, you could reuse existing infrastructure at least to migrate orted and > its tcp/ip connections > > Cheers, > > Gilles > > Federico Reghenzani wrote: > Hi Adrian and Gilles, > > first of all thank you for your responses. I'm working with Gianmario on > this ambitious project. > > 2015-10-22 13:17 GMT+02:00 Gilles Gouaillardet < > gilles.gouaillar...@gmail.com>: > >> Gianmario, >> >> there was c/r support in the v1.6 series but it has been removed. >> the current trend is to do application level checkpointing >> (much more efficient and much smaller checkpoint file size) >> >> iirc, ompi took care of closing/restoring all communication, and a third >> party checkpoint was required to checkpoint/restart *standalone* processes. >> >> generally speaking, mpirun and orted communicate via tcp >> orted and MPI (intra node comms) currently use tcp but we are moving to >> unix sockets >> MPI tasks communicate via btl (infiniband, tcp, shared memory, ...) >> >> > We have also seen that orted opens 2 pipe to each child, is it correct? > Does orted use them to communicate with children? > > > >> imho, moving only one MPI task to an other node is much harder, not to >> say impossible, than moving orted and its children MPI tasks to an other >> node >> >> > Mmm, I can ask you why? I mean, if we migrate the entire orted we need to > close/reopen *mpirun-orted* and *task-task* (btl) sockets, and if we > migrate the single task we need to close/reopen *orte-task* and > *task-task *sockets. In both cases we have to broadcast the information > of "changing location" of the task or orted. > > > >> Cheers, >> >> Gilles >> >> >> On Thursday, October 22, 2015, Gianmario Pozzi >> wrote: >> >>> Hi everyone! >>> >>> My team and I are working on the possibility to checkpoint a process and >>> restarting it on another node. We are using CRIU framework for the >>> checkpoint/restart part, but we are facing some issues related to migration. >>> >>> First of all: we found out that some attempts to C/R an OMPI process >>> have been already made in the past. Is anything related to that still >>> supported/available/working? >>> >>> Then, we need to know which network communications are used at any time, >>> in order to "pause" them during migrations (at least the ones involving the >>> migrating node). Our code analysis makes us think that: >>> -OpenMPI runtime (HNP<->orteds) uses orte/OOB >>> -Running applications exchange data via ompi/BTL >>> >>> Is that correct? If not, can someone give us a hint? >>> >>> Questions on how to update topology info may be yet to come. >>> >>> Thank you guys! >>> >>> Gianmario >>> >> >> ___ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2015/10/18242.php >> > > > Cheers, > Federico > __ > Federico Reghenzani > M.Eng. Student @ Politecnico di Milano > Computer Science and Engineering > > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/10/18253.ph
Re: [OMPI devel] OMPI devel] Checkpoint/restart + migration
Each module has the opportunity to provide an ft_event function, that is supposedly called when a change in the module behavior is necessary. Thus, it is relatively easy to let the BTL knows about the fact that a particular destination process will migrate to a new location. George. On Fri, Oct 23, 2015 at 5:45 AM, Gilles Gouaillardet < gilles.gouaillar...@gmail.com> wrote: > Gianmario, > > Iirc, there is one pipe between orted and each children stderr. > stdout is a pty, and stdin is /dev/null, but it might be a pipe on task 0 > This is the way stdout/stderr from tasks end up being printed by mpirun : > orted does i/o forwarding (aka IOF) > > are you trying to migrate only one task (and other tasks still run) or are > you trying to checkpoint and restart on a different set of nodes ? > > Typically, a task uses shared memory for intra node communications, and > infiniband or tcp for inter node communications. > So if you migrate only one task, and i assume you have no virtual shared > memory, then you need to notify its neighbors they have to switch from shm > to ib/tcp. > At first glance, that is much harder than moving orted and its children : > You would "only" have to re-establish all connections and migrate the shm. > Also, orted assumes/need its children are running on the same node, (they > use a session dir in /tmp, orted waits SIGCHLD when its child dies,...) so > if you migrate everything, you do not have to worry about that part. > > You might also want to consider some virtualization : > If a node is running in its own vm, or its own container with a virtual > ip, you could reuse existing infrastructure at least to migrate orted and > its tcp/ip connections > > Cheers, > > Gilles > > Federico Reghenzani wrote: > Hi Adrian and Gilles, > > first of all thank you for your responses. I'm working with Gianmario on > this ambitious project. > > 2015-10-22 13:17 GMT+02:00 Gilles Gouaillardet < > gilles.gouaillar...@gmail.com>: > >> Gianmario, >> >> there was c/r support in the v1.6 series but it has been removed. >> the current trend is to do application level checkpointing >> (much more efficient and much smaller checkpoint file size) >> >> iirc, ompi took care of closing/restoring all communication, and a third >> party checkpoint was required to checkpoint/restart *standalone* processes. >> >> generally speaking, mpirun and orted communicate via tcp >> orted and MPI (intra node comms) currently use tcp but we are moving to >> unix sockets >> MPI tasks communicate via btl (infiniband, tcp, shared memory, ...) >> >> > We have also seen that orted opens 2 pipe to each child, is it correct? > Does orted use them to communicate with children? > > > >> imho, moving only one MPI task to an other node is much harder, not to >> say impossible, than moving orted and its children MPI tasks to an other >> node >> >> > Mmm, I can ask you why? I mean, if we migrate the entire orted we need to > close/reopen *mpirun-orted* and *task-task* (btl) sockets, and if we > migrate the single task we need to close/reopen *orte-task* and > *task-task *sockets. In both cases we have to broadcast the information > of "changing location" of the task or orted. > > > >> Cheers, >> >> Gilles >> >> >> On Thursday, October 22, 2015, Gianmario Pozzi >> wrote: >> >>> Hi everyone! >>> >>> My team and I are working on the possibility to checkpoint a process and >>> restarting it on another node. We are using CRIU framework for the >>> checkpoint/restart part, but we are facing some issues related to migration. >>> >>> First of all: we found out that some attempts to C/R an OMPI process >>> have been already made in the past. Is anything related to that still >>> supported/available/working? >>> >>> Then, we need to know which network communications are used at any time, >>> in order to "pause" them during migrations (at least the ones involving the >>> migrating node). Our code analysis makes us think that: >>> -OpenMPI runtime (HNP<->orteds) uses orte/OOB >>> -Running applications exchange data via ompi/BTL >>> >>> Is that correct? If not, can someone give us a hint? >>> >>> Questions on how to update topology info may be yet to come. >>> >>> Thank you guys! >>> >>> Gianmario >>> >> >> ___ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2015/10/18242.php >> > > > Cheers, > Federico > __ > Federico Reghenzani > M.Eng. Student @ Politecnico di Milano > Computer Science and Engineering > > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/10/18253.php >
Re: [OMPI devel] Checkpoint/restart + migration
On Oct 22, 2015, at 7:17 AM, Gilles Gouaillardet wrote: > > Gianmario, > > there was c/r support in the v1.6 series but it has been removed. To be specific: the C/R support was removed from the v2.x branch because it is stale / not working. The support is still in master, albeit with Adrian's disclaimers (it's stale / not working, but could be fixed with some work). -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] OMPI devel] Checkpoint/restart + migration
Gianmario, Iirc, there is one pipe between orted and each children stderr. stdout is a pty, and stdin is /dev/null, but it might be a pipe on task 0 This is the way stdout/stderr from tasks end up being printed by mpirun : orted does i/o forwarding (aka IOF) are you trying to migrate only one task (and other tasks still run) or are you trying to checkpoint and restart on a different set of nodes ? Typically, a task uses shared memory for intra node communications, and infiniband or tcp for inter node communications. So if you migrate only one task, and i assume you have no virtual shared memory, then you need to notify its neighbors they have to switch from shm to ib/tcp. At first glance, that is much harder than moving orted and its children : You would "only" have to re-establish all connections and migrate the shm. Also, orted assumes/need its children are running on the same node, (they use a session dir in /tmp, orted waits SIGCHLD when its child dies,...) so if you migrate everything, you do not have to worry about that part. You might also want to consider some virtualization : If a node is running in its own vm, or its own container with a virtual ip, you could reuse existing infrastructure at least to migrate orted and its tcp/ip connections Cheers, Gilles Federico Reghenzani wrote: >Hi Adrian and Gilles, > > >first of all thank you for your responses. I'm working with Gianmario on this >ambitious project. > > >2015-10-22 13:17 GMT+02:00 Gilles Gouaillardet : > >Gianmario, > > >there was c/r support in the v1.6 series but it has been removed. > >the current trend is to do application level checkpointing > >(much more efficient and much smaller checkpoint file size) > > >iirc, ompi took care of closing/restoring all communication, and a third party >checkpoint was required to checkpoint/restart *standalone* processes. > > >generally speaking, mpirun and orted communicate via tcp > >orted and MPI (intra node comms) currently use tcp but we are moving to unix >sockets > >MPI tasks communicate via btl (infiniband, tcp, shared memory, ...) > > > >We have also seen that orted opens 2 pipe to each child, is it correct? Does >orted use them to communicate with children? > > > > >imho, moving only one MPI task to an other node is much harder, not to say >impossible, than moving orted and its children MPI tasks to an other node > > > >Mmm, I can ask you why? I mean, if we migrate the entire orted we need to >close/reopen mpirun-orted and task-task (btl) sockets, and if we migrate the >single task we need to close/reopen orte-task and task-task sockets. In both >cases we have to broadcast the information of "changing location" of the task >or orted. > > > > >Cheers, > > >Gilles > > > >On Thursday, October 22, 2015, Gianmario Pozzi wrote: > >Hi everyone! > > >My team and I are working on the possibility to checkpoint a process and >restarting it on another node. We are using CRIU framework for the >checkpoint/restart part, but we are facing some issues related to migration. > > >First of all: we found out that some attempts to C/R an OMPI process have been >already made in the past. Is anything related to that still >supported/available/working? > > >Then, we need to know which network communications are used at any time, in >order to "pause" them during migrations (at least the ones involving the >migrating node). Our code analysis makes us think that: > >-OpenMPI runtime (HNP<->orteds) uses orte/OOB > >-Running applications exchange data via ompi/BTL > > >Is that correct? If not, can someone give us a hint? > > >Questions on how to update topology info may be yet to come. > > >Thank you guys! > > >Gianmario > > >___ >devel mailing list >de...@open-mpi.org >Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >Link to this post: >http://www.open-mpi.org/community/lists/devel/2015/10/18242.php > > > >Cheers, >Federico > >__ > >Federico Reghenzani > >M.Eng. Student @ Politecnico di Milano > >Computer Science and Engineering > >
Re: [OMPI devel] Checkpoint/restart + migration
Hi Adrian and Gilles, first of all thank you for your responses. I'm working with Gianmario on this ambitious project. 2015-10-22 13:17 GMT+02:00 Gilles Gouaillardet < gilles.gouaillar...@gmail.com>: > Gianmario, > > there was c/r support in the v1.6 series but it has been removed. > the current trend is to do application level checkpointing > (much more efficient and much smaller checkpoint file size) > > iirc, ompi took care of closing/restoring all communication, and a third > party checkpoint was required to checkpoint/restart *standalone* processes. > > generally speaking, mpirun and orted communicate via tcp > orted and MPI (intra node comms) currently use tcp but we are moving to > unix sockets > MPI tasks communicate via btl (infiniband, tcp, shared memory, ...) > > We have also seen that orted opens 2 pipe to each child, is it correct? Does orted use them to communicate with children? > imho, moving only one MPI task to an other node is much harder, not to say > impossible, than moving orted and its children MPI tasks to an other node > > Mmm, I can ask you why? I mean, if we migrate the entire orted we need to close/reopen *mpirun-orted* and *task-task* (btl) sockets, and if we migrate the single task we need to close/reopen *orte-task* and *task-task *sockets. In both cases we have to broadcast the information of "changing location" of the task or orted. > Cheers, > > Gilles > > > On Thursday, October 22, 2015, Gianmario Pozzi > wrote: > >> Hi everyone! >> >> My team and I are working on the possibility to checkpoint a process and >> restarting it on another node. We are using CRIU framework for the >> checkpoint/restart part, but we are facing some issues related to migration. >> >> First of all: we found out that some attempts to C/R an OMPI process have >> been already made in the past. Is anything related to that still >> supported/available/working? >> >> Then, we need to know which network communications are used at any time, >> in order to "pause" them during migrations (at least the ones involving the >> migrating node). Our code analysis makes us think that: >> -OpenMPI runtime (HNP<->orteds) uses orte/OOB >> -Running applications exchange data via ompi/BTL >> >> Is that correct? If not, can someone give us a hint? >> >> Questions on how to update topology info may be yet to come. >> >> Thank you guys! >> >> Gianmario >> > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/10/18242.php > Cheers, Federico __ Federico Reghenzani M.Eng. Student @ Politecnico di Milano Computer Science and Engineering
Re: [OMPI devel] Checkpoint/restart + migration
Gianmario, there was c/r support in the v1.6 series but it has been removed. the current trend is to do application level checkpointing (much more efficient and much smaller checkpoint file size) iirc, ompi took care of closing/restoring all communication, and a third party checkpoint was required to checkpoint/restart *standalone* processes. generally speaking, mpirun and orted communicate via tcp orted and MPI (intra node comms) currently use tcp but we are moving to unix sockets MPI tasks communicate via btl (infiniband, tcp, shared memory, ...) imho, moving only one MPI task to an other node is much harder, not to say impossible, than moving orted and its children MPI tasks to an other node Cheers, Gilles On Thursday, October 22, 2015, Gianmario Pozzi wrote: > Hi everyone! > > My team and I are working on the possibility to checkpoint a process and > restarting it on another node. We are using CRIU framework for the > checkpoint/restart part, but we are facing some issues related to migration. > > First of all: we found out that some attempts to C/R an OMPI process have > been already made in the past. Is anything related to that still > supported/available/working? > > Then, we need to know which network communications are used at any time, > in order to "pause" them during migrations (at least the ones involving the > migrating node). Our code analysis makes us think that: > -OpenMPI runtime (HNP<->orteds) uses orte/OOB > -Running applications exchange data via ompi/BTL > > Is that correct? If not, can someone give us a hint? > > Questions on how to update topology info may be yet to come. > > Thank you guys! > > Gianmario >
Re: [OMPI devel] Checkpoint/restart + migration
On Thu, Oct 22, 2015 at 12:15:22PM +0200, Gianmario Pozzi wrote: > My team and I are working on the possibility to checkpoint a process and > restarting it on another node. We are using CRIU framework for the > checkpoint/restart part, but we are facing some issues related to migration. > > First of all: we found out that some attempts to C/R an OMPI process have > been already made in the past. Is anything related to that still > supported/available/working? I was working on the CRIU <-> OpenMPI integration during 2013/2014. The code is still available at: https://github.com/open-mpi/ompi/tree/master/opal/mca/crs/criu I was able to checkpoint and restart a process under OpenMPI's control: http://lisas.de/~adrian/?p=926 >From what I have heard/read OpenMPI has probably had enough internal changes that the Fault Tolerance framework is currently no longer working which is needed to use the checkpoint/restart functionality. In addition, CRIU has also changed a bit. I used the criu service daemon to start the checkpoint. This service daemon no longer exists due to security concerns: https://lwn.net/Articles/658070/ So you either need to call the criu binary directly or you can use 'criu swrk'. Restore should be easier as criu now supports the option --inherit-fd which should help to correctly re-route stdin/stdout/stderr. Adrian
[OMPI devel] Checkpoint/restart + migration
Hi everyone! My team and I are working on the possibility to checkpoint a process and restarting it on another node. We are using CRIU framework for the checkpoint/restart part, but we are facing some issues related to migration. First of all: we found out that some attempts to C/R an OMPI process have been already made in the past. Is anything related to that still supported/available/working? Then, we need to know which network communications are used at any time, in order to "pause" them during migrations (at least the ones involving the migrating node). Our code analysis makes us think that: -OpenMPI runtime (HNP<->orteds) uses orte/OOB -Running applications exchange data via ompi/BTL Is that correct? If not, can someone give us a hint? Questions on how to update topology info may be yet to come. Thank you guys! Gianmario