Hi again... sorry I am again asking about this, but I really need to get the restart working... and I don't know where to look for... I think I am doing the correct things, but it doesn't work..
Any advice will be very appreciated! best regards, Marina On 11/26/15, Marina Moran <esperandoelmila...@gmail.com> wrote: > Hi Jiajun: > > I tried what you told me, here is the output: > > $ dmtcp_restart -h 10.0.2.20 -p 7779 -j > ckpt_app_heat_512_1310c955e7a-53000-56578f4d.dmtcp > [1131] ERROR at dmtcp_restart.cpp:760 in main; > REASON='JASSERT(independentProcessTreeRoots.size() > 0) failed' > Message: There must be at least one process tree that doesn't have > a different process as session leader. > dmtcp_restart (1131): Terminating... > > > I tried without the -h and -p options as well (cause I set the > variable DMTCP_HOST) and tried with all the .dmtcp files (in both > nodes) and all gives the same error. > > Am I missing something perhaps? > > Thanks again for all your help! > Regards, > Marina > > On 11/20/15, Jiajun Cao <jia...@ccs.neu.edu> wrote: >> In this case, instead of running the restart script, you can restart from >> the checkpoint images directly, i.e., >> >> dmtcp_restart -h $coordinator_addr -p $coordinator_port -j ckpt_XX.dmtcp >> >> You will need to run this command n times, where n is the number of >> checkpoint images (number of processes). >> >> >> Best, >> Jiajun >> >> On Thu, Nov 19, 2015 at 10:35 AM, Marina Moran >> <esperandoelmila...@gmail.com >>> wrote: >> >>> Hi Jiajun, >>> >>> You are right, it seems to be the reason, the doesnt connect.. >>> >>> Each node write the checkpoint to a directory that is mount with NFS >>> in other two machines. I can see the files in both nodes. I set >>> DMTCP_CHECKPOINT_DIR to that folder. Anyway, I try to wirte the >>> locally, but it is the same problem. >>> >>> I copy the output of 'ls' command on the checkpoint folder on both >>> nodes: >>> node m110a (the one that connects, where the coordinator runs): >>> >>> hpcpro@m110a:~/nfs$ ls -l >>> total 26536 >>> -rw------- 1 hpcpro hpcpro 5122030 Nov 19 12:31 >>> ckpt_app_heat_512_1310c955e7a-71000-564e0755.dmtcp >>> drwxr-xr-x 2 hpcpro hpcpro 4096 Nov 19 12:31 >>> ckpt_app_heat_512_1310c955e7a-71000-564e0755_files >>> -rw------- 1 hpcpro hpcpro 5122218 Nov 19 12:31 >>> ckpt_app_heat_512_1310c955e7a-74000-564e0755.dmtcp >>> drwxr-xr-x 2 hpcpro hpcpro 4096 Nov 19 12:31 >>> ckpt_app_heat_512_1310c955e7a-74000-564e0755_files >>> -rw------- 1 hpcpro hpcpro 5119200 Nov 19 12:31 >>> ckpt_app_heat_512_1310c955e7a-76000-564e0755.dmtcp >>> drwxr-xr-x 2 hpcpro hpcpro 4096 Nov 19 12:31 >>> ckpt_app_heat_512_1310c955e7a-76000-564e0755_files >>> -rw------- 1 hpcpro hpcpro 5119579 Nov 19 12:31 >>> ckpt_app_heat_512_1310c955e7a-78000-564e0755.dmtcp >>> drwxr-xr-x 2 hpcpro hpcpro 4096 Nov 19 12:31 >>> ckpt_app_heat_512_1310c955e7a-78000-564e0755_files >>> -rw------- 1 hpcpro hpcpro 2736819 Nov 19 12:31 >>> ckpt_dmtcp_ssh_1310c955e7a-67000-564e0754.dmtcp >>> -rw------- 1 hpcpro hpcpro 3899774 Nov 19 12:31 >>> ckpt_orterun_1310c955e7a-66000-564e0754.dmtcp >>> drwxr-xr-x 2 hpcpro hpcpro 4096 Nov 19 12:31 >>> ckpt_orterun_1310c955e7a-66000-564e0754_files >>> -rwxr--r-- 1 hpcpro hpcpro 12091 Nov 19 12:31 >>> dmtcp_restart_script_1310c955e7a-66000-564e0754.sh >>> lrwxrwxrwx 1 hpcpro hpcpro 67 Nov 19 12:31 >>> dmtcp_restart_script.sh -> >>> /home/hpcpro/nfs/dmtcp_restart_script_1310c955e7a-66000-564e0754.sh >>> -rw-r--r-- 1 hpcpro hpcpro 254 Nov 19 12:31 jtimings.csv >>> >>> node m111a (the one doesnt connect) >>> hpcpro@m111a:~/nfs$ ls -l >>> total 26348 >>> -rw------- 1 hpcpro hpcpro 5117116 Nov 19 12:31 >>> ckpt_app_heat_512_1310c955fc5-72000-564e0755.dmtcp >>> drwxr-xr-x 2 hpcpro hpcpro 4096 Nov 19 12:31 >>> ckpt_app_heat_512_1310c955fc5-72000-564e0755_files >>> -rw------- 1 hpcpro hpcpro 5116818 Nov 19 12:31 >>> ckpt_app_heat_512_1310c955fc5-73000-564e0755.dmtcp >>> drwxr-xr-x 2 hpcpro hpcpro 4096 Nov 19 12:31 >>> ckpt_app_heat_512_1310c955fc5-73000-564e0755_files >>> -rw------- 1 hpcpro hpcpro 5116096 Nov 19 12:31 >>> ckpt_app_heat_512_1310c955fc5-75000-564e0755.dmtcp >>> drwxr-xr-x 2 hpcpro hpcpro 4096 Nov 19 12:31 >>> ckpt_app_heat_512_1310c955fc5-75000-564e0755_files >>> -rw------- 1 hpcpro hpcpro 5115075 Nov 19 12:31 >>> ckpt_app_heat_512_1310c955fc5-77000-564e0755.dmtcp >>> drwxr-xr-x 2 hpcpro hpcpro 4096 Nov 19 12:31 >>> ckpt_app_heat_512_1310c955fc5-77000-564e0755_files >>> -rw------- 1 hpcpro hpcpro 2736548 Nov 19 12:31 >>> ckpt_dmtcp_sshd_1310c955fc5-69000-564e0755.dmtcp >>> -rw------- 1 hpcpro hpcpro 3742556 Nov 19 12:31 >>> ckpt_orted_1310c955fc5-70000-564e0755.dmtcp >>> drwxr-xr-x 2 hpcpro hpcpro 4096 Nov 19 12:31 >>> ckpt_orted_1310c955fc5-70000-564e0755_files >>> >>> >>> I cant figure out what can it be... Thanks for your help, >>> regards >>> Marina >>> >>> On 11/18/15, Jiajun Cao <jia...@ccs.neu.edu> wrote: >>> > Hi Marina, >>> > >>> > Where are checkpoint images stored? Are they stored in a shared file >>> > system, or to local storage? From what I can tell from the log, >>> > there're >>> 12 >>> > processes before checkpoint, and hence 12 checkpoint images. On >>> > restart, >>> > only 6 of them connect to the coordinator. It may be the fact that the >>> > restart script couldn't find the rest images. Could you verify that? >>> > >>> > Best, >>> > Jiajun >>> > >>> > On Wed, Nov 18, 2015 at 6:03 PM, Marina Moran >>> > <esperandoelmila...@gmail.com> >>> > wrote: >>> > >>> >> Hi All! I'm back using DMTCP! >>> >> >>> >> I'm having a problem when restarting a checkpoint. >>> >> >>> >> I have two nodes (PCs) in an ethernet lan, with: >>> >> -Debian 8 Jessi, >>> >> -DMTCP 2.4.2 (configure with -enable-timing) >>> >> -OpenMPI 1.10.1. >>> >> >>> >> I do: >>> >> $ dmtcp_launch mpirun -np 8 -hostfile hosts app_heat_512 >>> >> >>> >> On the console where the coordinator is running, I press 'c' to >>> >> checkpoint. After that, I killed the application or it finished, and >>> >> then, from the same directory where the checkpoints are stored I run >>> >> the restarting script, with the following output: >>> >> >>> >> $ ./dmtcp_restart_script.sh >>> >> [75000] WARNING at socketconnection.cpp:540 in postRestart; >>> >> REASON='JWARNING(_real_bind(_fds[0], (sockaddr*) >>> >> &_bindAddr,_bindAddrlen) == 0) failed' >>> >> (strerror((*__errno_location ()))) = Address already in use >>> >> id() = 1310c955e7a-75000-564d1a58(99506) >>> >> Message: Bind failed. >>> >> [77000] WARNING at socketconnection.cpp:540 in postRestart; >>> >> REASON='JWARNING(_real_bind(_fds[0], (sockaddr*) >>> >> &_bindAddr,_bindAddrlen) == 0) failed' >>> >> (strerror((*__errno_location ()))) = Address already in use >>> >> id() = 1310c955e7a-77000-564d1a58(99517) >>> >> Message: Bind failed. >>> >> >>> >> On the coordinator console it outputs this: >>> >> >>> >> [762] NOTE at dmtcp_coordinator.cpp:1137 in >>> >> validateRestartingWorkerProcess; REASON='FIRST dmtcp_restart >>> >> connection. Set numPeers. Generate timestamp' >>> >> numPeers = 12 >>> >> curTimeStamp = 23166315138 >>> >> compId = 1310c955e7a-66000-564d1a57 >>> >> [762] WARNING at jtimer.h:81 in start; REASON='JWARNING(!_isStarted) >>> >> failed' >>> >> _name = restart >>> >> [762] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker >>> >> connected' >>> >> hello_remote.from = 1310c955e7a-66000-564d1a57 >>> >> [762] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker >>> >> connected' >>> >> hello_remote.from = 1310c955e7a-67000-564d1a57 >>> >> [762] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker >>> >> connected' >>> >> hello_remote.from = 1310c955e7a-71000-564d1a58 >>> >> [762] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker >>> >> connected' >>> >> hello_remote.from = 1310c955e7a-73000-564d1a58 >>> >> [762] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker >>> >> connected' >>> >> hello_remote.from = 1310c955e7a-77000-564d1a58 >>> >> [762] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker >>> >> connected' >>> >> hello_remote.from = 1310c955e7a-75000-564d1a58 >>> >> >>> >> >>> >> And when I pressed L to show connected nodes: >>> >> l >>> >> Client List: >>> >> #, PROG[virtPID:realPID]@HOST, DMTCP-UNIQUEPID, STATE >>> >> 64, orterun[66000:1530]@m110a, 1310c955e7a-66000-564d1a57, >>> >> CHECKPOINTED >>> >> 65, dmtcp_ssh[67000:1618]@m110a, 1310c955e7a-67000-564d1a57, >>> CHECKPOINTED >>> >> 66, app_heat_512[71000:1619]@m110a, 1310c955e7a-71000-564d1a58, >>> >> CHECKPOINTED >>> >> 67, app_heat_512[73000:1620]@m110a, 1310c955e7a-73000-564d1a58, >>> >> CHECKPOINTED >>> >> 68, app_heat_512[77000:1622]@m110a, 1310c955e7a-77000-564d1a58, >>> >> CHECKPOINTED >>> >> 69, app_heat_512[75000:1621]@m110a, 1310c955e7a-75000-564d1a58, >>> >> CHECKPOINTED >>> >> >>> >> >>> >> It seems ti hangs... It never ends. >>> >> >>> >> Hope this is something I forgot... >>> >> >>> >> Thanks all in advance, >>> >> Regards >>> >> Marina >>> >> >>> >> >>> >> >>> ------------------------------------------------------------------------------ >>> >> _______________________________________________ >>> >> Dmtcp-forum mailing list >>> >> Dmtcp-forum@lists.sourceforge.net >>> >> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum >>> >> >>> > >>> >> > ------------------------------------------------------------------------------ Go from Idea to Many App Stores Faster with Intel(R) XDK Give your users amazing mobile app experiences with Intel(R) XDK. Use one codebase in this all-in-one HTML5 development environment. Design, debug & build mobile apps & 2D/3D high-impact games for multiple OSs. http://pubads.g.doubleclick.net/gampad/clk?id=254741911&iu=/4140 _______________________________________________ Dmtcp-forum mailing list Dmtcp-forum@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dmtcp-forum