Thanks for the reply and don't worry about the delay. Yeah, i supposse it wouln't be easy :(. But my final goal is what you are mentioning, is to stop one particular process (previously checkpointed) and the migrate it to another place (node, core, slot, etc.) and restart it there, but without making a coordinated checkpoint. I just need to checkpoint processes in an uncoordinated way, and move them.
Where can i see something about process migration in the code? or something that could guide me. Greetings. Hugo Meyer 2011/1/6 Jeff Squyres <jsquy...@cisco.com> > Sorry for the delay; you wrote while many of us were on vacation and we're > just now starting to catch up on past mails... > > I'm not entirely sure what you're trying to do. It sounds like you're > trying to replace one process with another. That's quite complicated; there > will be a lot of changes required in the code base to do this. > > - you'll need to notify the ORTE subsystem of the process change > - this notification will likely need to span multiple processes > - all MPI processes will need to quiesce their communications, disconnect, > and reconnect > - ...and probably other things > > That being said, you might be able to leverage some of the work that's been > done with checkpoint/restart/migration. It's not entirely the same thing > that you're doing, but it's at least similar (quiesce networks, [pretend to] > move a process from location A to location B, etc.). > > > > On Dec 28, 2010, at 7:03 AM, Hugo Meyer wrote: > > > Hello to all. > > > > I'm new in the forum, at least is the first time i write. > > > > I'm working with open mpi and I would do a little experiment, i will try > to pass one process by another process. > > > > For example, assuming that there are 2 processes that are communicating > say rank 1 and 2. And there is a process of rank 3, I would like the rank 3 > (it could be assumed that this node is marked down at the initial hostfile) > took the place of rank 2, and rank 1 still think that he is communicating > with rank 2 when in fact is communicating with the rank 3. > > > > I guess I'll have to modify tables as orte_job_map_t and orte_proc_t, but > I wanted to know if someone already has experience doing something similar, > and can guide me at least. > > > > The communication between processes, in principle, would be irrelevant, > so i will not need to use checkpoints / restarts for now. > > > > Greetings > > > > Hugo Meyer > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >