In this case, instead of running the restart script, you can restart from the checkpoint images directly, i.e.,
dmtcp_restart -h $coordinator_addr -p $coordinator_port -j ckpt_XX.dmtcp You will need to run this command n times, where n is the number of checkpoint images (number of processes). Best, Jiajun On Thu, Nov 19, 2015 at 10:35 AM, Marina Moran <esperandoelmila...@gmail.com > wrote: > Hi Jiajun, > > You are right, it seems to be the reason, the doesnt connect.. > > Each node write the checkpoint to a directory that is mount with NFS > in other two machines. I can see the files in both nodes. I set > DMTCP_CHECKPOINT_DIR to that folder. Anyway, I try to wirte the > locally, but it is the same problem. > > I copy the output of 'ls' command on the checkpoint folder on both nodes: > node m110a (the one that connects, where the coordinator runs): > > hpcpro@m110a:~/nfs$ ls -l > total 26536 > -rw------- 1 hpcpro hpcpro 5122030 Nov 19 12:31 > ckpt_app_heat_512_1310c955e7a-71000-564e0755.dmtcp > drwxr-xr-x 2 hpcpro hpcpro 4096 Nov 19 12:31 > ckpt_app_heat_512_1310c955e7a-71000-564e0755_files > -rw------- 1 hpcpro hpcpro 5122218 Nov 19 12:31 > ckpt_app_heat_512_1310c955e7a-74000-564e0755.dmtcp > drwxr-xr-x 2 hpcpro hpcpro 4096 Nov 19 12:31 > ckpt_app_heat_512_1310c955e7a-74000-564e0755_files > -rw------- 1 hpcpro hpcpro 5119200 Nov 19 12:31 > ckpt_app_heat_512_1310c955e7a-76000-564e0755.dmtcp > drwxr-xr-x 2 hpcpro hpcpro 4096 Nov 19 12:31 > ckpt_app_heat_512_1310c955e7a-76000-564e0755_files > -rw------- 1 hpcpro hpcpro 5119579 Nov 19 12:31 > ckpt_app_heat_512_1310c955e7a-78000-564e0755.dmtcp > drwxr-xr-x 2 hpcpro hpcpro 4096 Nov 19 12:31 > ckpt_app_heat_512_1310c955e7a-78000-564e0755_files > -rw------- 1 hpcpro hpcpro 2736819 Nov 19 12:31 > ckpt_dmtcp_ssh_1310c955e7a-67000-564e0754.dmtcp > -rw------- 1 hpcpro hpcpro 3899774 Nov 19 12:31 > ckpt_orterun_1310c955e7a-66000-564e0754.dmtcp > drwxr-xr-x 2 hpcpro hpcpro 4096 Nov 19 12:31 > ckpt_orterun_1310c955e7a-66000-564e0754_files > -rwxr--r-- 1 hpcpro hpcpro 12091 Nov 19 12:31 > dmtcp_restart_script_1310c955e7a-66000-564e0754.sh > lrwxrwxrwx 1 hpcpro hpcpro 67 Nov 19 12:31 > dmtcp_restart_script.sh -> > /home/hpcpro/nfs/dmtcp_restart_script_1310c955e7a-66000-564e0754.sh > -rw-r--r-- 1 hpcpro hpcpro 254 Nov 19 12:31 jtimings.csv > > node m111a (the one doesnt connect) > hpcpro@m111a:~/nfs$ ls -l > total 26348 > -rw------- 1 hpcpro hpcpro 5117116 Nov 19 12:31 > ckpt_app_heat_512_1310c955fc5-72000-564e0755.dmtcp > drwxr-xr-x 2 hpcpro hpcpro 4096 Nov 19 12:31 > ckpt_app_heat_512_1310c955fc5-72000-564e0755_files > -rw------- 1 hpcpro hpcpro 5116818 Nov 19 12:31 > ckpt_app_heat_512_1310c955fc5-73000-564e0755.dmtcp > drwxr-xr-x 2 hpcpro hpcpro 4096 Nov 19 12:31 > ckpt_app_heat_512_1310c955fc5-73000-564e0755_files > -rw------- 1 hpcpro hpcpro 5116096 Nov 19 12:31 > ckpt_app_heat_512_1310c955fc5-75000-564e0755.dmtcp > drwxr-xr-x 2 hpcpro hpcpro 4096 Nov 19 12:31 > ckpt_app_heat_512_1310c955fc5-75000-564e0755_files > -rw------- 1 hpcpro hpcpro 5115075 Nov 19 12:31 > ckpt_app_heat_512_1310c955fc5-77000-564e0755.dmtcp > drwxr-xr-x 2 hpcpro hpcpro 4096 Nov 19 12:31 > ckpt_app_heat_512_1310c955fc5-77000-564e0755_files > -rw------- 1 hpcpro hpcpro 2736548 Nov 19 12:31 > ckpt_dmtcp_sshd_1310c955fc5-69000-564e0755.dmtcp > -rw------- 1 hpcpro hpcpro 3742556 Nov 19 12:31 > ckpt_orted_1310c955fc5-70000-564e0755.dmtcp > drwxr-xr-x 2 hpcpro hpcpro 4096 Nov 19 12:31 > ckpt_orted_1310c955fc5-70000-564e0755_files > > > I cant figure out what can it be... Thanks for your help, > regards > Marina > > On 11/18/15, Jiajun Cao <jia...@ccs.neu.edu> wrote: > > Hi Marina, > > > > Where are checkpoint images stored? Are they stored in a shared file > > system, or to local storage? From what I can tell from the log, there're > 12 > > processes before checkpoint, and hence 12 checkpoint images. On restart, > > only 6 of them connect to the coordinator. It may be the fact that the > > restart script couldn't find the rest images. Could you verify that? > > > > Best, > > Jiajun > > > > On Wed, Nov 18, 2015 at 6:03 PM, Marina Moran > > <esperandoelmila...@gmail.com> > > wrote: > > > >> Hi All! I'm back using DMTCP! > >> > >> I'm having a problem when restarting a checkpoint. > >> > >> I have two nodes (PCs) in an ethernet lan, with: > >> -Debian 8 Jessi, > >> -DMTCP 2.4.2 (configure with -enable-timing) > >> -OpenMPI 1.10.1. > >> > >> I do: > >> $ dmtcp_launch mpirun -np 8 -hostfile hosts app_heat_512 > >> > >> On the console where the coordinator is running, I press 'c' to > >> checkpoint. After that, I killed the application or it finished, and > >> then, from the same directory where the checkpoints are stored I run > >> the restarting script, with the following output: > >> > >> $ ./dmtcp_restart_script.sh > >> [75000] WARNING at socketconnection.cpp:540 in postRestart; > >> REASON='JWARNING(_real_bind(_fds[0], (sockaddr*) > >> &_bindAddr,_bindAddrlen) == 0) failed' > >> (strerror((*__errno_location ()))) = Address already in use > >> id() = 1310c955e7a-75000-564d1a58(99506) > >> Message: Bind failed. > >> [77000] WARNING at socketconnection.cpp:540 in postRestart; > >> REASON='JWARNING(_real_bind(_fds[0], (sockaddr*) > >> &_bindAddr,_bindAddrlen) == 0) failed' > >> (strerror((*__errno_location ()))) = Address already in use > >> id() = 1310c955e7a-77000-564d1a58(99517) > >> Message: Bind failed. > >> > >> On the coordinator console it outputs this: > >> > >> [762] NOTE at dmtcp_coordinator.cpp:1137 in > >> validateRestartingWorkerProcess; REASON='FIRST dmtcp_restart > >> connection. Set numPeers. Generate timestamp' > >> numPeers = 12 > >> curTimeStamp = 23166315138 > >> compId = 1310c955e7a-66000-564d1a57 > >> [762] WARNING at jtimer.h:81 in start; REASON='JWARNING(!_isStarted) > >> failed' > >> _name = restart > >> [762] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker > >> connected' > >> hello_remote.from = 1310c955e7a-66000-564d1a57 > >> [762] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker > >> connected' > >> hello_remote.from = 1310c955e7a-67000-564d1a57 > >> [762] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker > >> connected' > >> hello_remote.from = 1310c955e7a-71000-564d1a58 > >> [762] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker > >> connected' > >> hello_remote.from = 1310c955e7a-73000-564d1a58 > >> [762] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker > >> connected' > >> hello_remote.from = 1310c955e7a-77000-564d1a58 > >> [762] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker > >> connected' > >> hello_remote.from = 1310c955e7a-75000-564d1a58 > >> > >> > >> And when I pressed L to show connected nodes: > >> l > >> Client List: > >> #, PROG[virtPID:realPID]@HOST, DMTCP-UNIQUEPID, STATE > >> 64, orterun[66000:1530]@m110a, 1310c955e7a-66000-564d1a57, CHECKPOINTED > >> 65, dmtcp_ssh[67000:1618]@m110a, 1310c955e7a-67000-564d1a57, > CHECKPOINTED > >> 66, app_heat_512[71000:1619]@m110a, 1310c955e7a-71000-564d1a58, > >> CHECKPOINTED > >> 67, app_heat_512[73000:1620]@m110a, 1310c955e7a-73000-564d1a58, > >> CHECKPOINTED > >> 68, app_heat_512[77000:1622]@m110a, 1310c955e7a-77000-564d1a58, > >> CHECKPOINTED > >> 69, app_heat_512[75000:1621]@m110a, 1310c955e7a-75000-564d1a58, > >> CHECKPOINTED > >> > >> > >> It seems ti hangs... It never ends. > >> > >> Hope this is something I forgot... > >> > >> Thanks all in advance, > >> Regards > >> Marina > >> > >> > >> > ------------------------------------------------------------------------------ > >> _______________________________________________ > >> Dmtcp-forum mailing list > >> Dmtcp-forum@lists.sourceforge.net > >> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum > >> > > >
------------------------------------------------------------------------------
_______________________________________________ Dmtcp-forum mailing list Dmtcp-forum@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dmtcp-forum