Hi, Currently I am trying to run ParastationMPI together with dmtcp-2.4.5. In order for this to work I have to disable the DL Plugin. What kind of problems do I have to expect or in which scenario is the DL Plugin crucial?
Additionally I am trying to use hbict to achieve a form of pre-copy live migration. In this scenario I am using MPICH 3.2 and dmtcp-2.4.5 configured with delta-compression enabled. I have tested the following scenario with and without gzip. The result is the same. I am able to checkpoint my application without any problems. Upon restart I get the following error message and a few processes terminate which causes the computation to wait endlessly for the missing processes. There does not seem to be a difference if I use multiple nodes or a single node. I save the images to /var/tmp/ckptfiles. Restart fails with the following message: MAX checkpoint index is 5, blknum = 198563 Create result checkpoint: .size = 1 [19532] WARNING at dmtcp_restart.cpp:300 in createProcess; REASON='JWARNING(setsid() != -1) failed' getsid(0) = 19504 (strerror((*__errno_location ()))) = Operation not permitted Message: Failed to restore this process as session leader. ................................. COMPLETE ...[19564] mtcp_util.ic:306 mtcp_skipfile: mtcp_sys_mmap() failed with error: 12... COMPLETE .........[19567] mtcp_util.ic:306 mtcp_skipfile: mtcp_sys_mmap() failed with error: 12....... COMPLETE . COMPLETE ....[19571] mtcp_restart.c:963 read_one_memory_area: Assertion failed: 0 The Coordinator informs me that a couple of processes have terminated: 11548] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client disconnected' client->identity() = 599ebe687dfd668d-47000-5804dd42 client->progname() = lu.D.16 [11548] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client disconnected' client->identity() = 599ebe687dfd668d-50000-5804dd42 client->progname() = lu.D.16 How can I resolve this problem? Am I doing something wrong? Two more, general questions: Is it possible to specify different checkpoint directories for processes on different nodes during launch? Is it possible to increase the number of hosts upon restart? If I launch a computation on two nodes, I naturally only get one dmtcp_sshd, and one hydra image for the remote node. Could I somehow launch a second dmtcp_sshd on a third node? Kind regards, Moritz ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, SlashDot.org! http://sdm.link/slashdot _______________________________________________ Dmtcp-forum mailing list Dmtcp-forum@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dmtcp-forum