Hi,

Currently I am trying to run ParastationMPI together with dmtcp-2.4.5. 
In order for this to work I have to disable the DL Plugin.
What kind of problems do I have to expect or in which scenario is the
DL Plugin crucial?

Additionally I am trying to use hbict to achieve a form of pre-copy
live migration. In this scenario I am using MPICH 3.2 and dmtcp-2.4.5
configured with delta-compression enabled.

I have tested the following scenario with and without gzip. The result
is the same.

I am able to checkpoint my application without any problems.
Upon restart I get the following error message and a few processes
terminate which causes the computation to wait endlessly for the
missing processes. There does not seem to be a difference if I use
multiple nodes or a single node. I save the images to
/var/tmp/ckptfiles. 

Restart fails with the following message:

MAX checkpoint index is 5, blknum = 198563
Create result checkpoint: .size = 1
[19532] WARNING at dmtcp_restart.cpp:300 in createProcess;
REASON='JWARNING(setsid() != -1) failed'
     getsid(0) = 19504
     (strerror((*__errno_location ()))) = Operation not permitted
Message: Failed to restore this process as session leader.
.................................
COMPLETE
...[19564] mtcp_util.ic:306 mtcp_skipfile:
  mtcp_sys_mmap() failed with error: 12...
COMPLETE
.........[19567] mtcp_util.ic:306 mtcp_skipfile:
  mtcp_sys_mmap() failed with error: 12.......
COMPLETE
.
COMPLETE
....[19571] mtcp_restart.c:963 read_one_memory_area:
  Assertion failed: 0


The Coordinator informs me that a couple of processes have terminated:

11548] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect;
REASON='client disconnected'
     client->identity() = 599ebe687dfd668d-47000-5804dd42
     client->progname() = lu.D.16
[11548] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect;
REASON='client disconnected'
     client->identity() = 599ebe687dfd668d-50000-5804dd42
     client->progname() = lu.D.16



How can I resolve this problem? Am I doing something wrong?

Two more, general questions:

Is it possible to specify different checkpoint directories for
processes on different nodes during launch?

Is it possible to increase the number of hosts upon restart?
If I launch a computation on two nodes, I naturally only get one
dmtcp_sshd, and one hydra image for the remote node. Could I somehow 
launch a second dmtcp_sshd on a third node?

Kind regards,
Moritz 


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to