Hi Jiajun:

I tried what you told me, here is the output:

$ dmtcp_restart -h 10.0.2.20 -p 7779 -j
ckpt_app_heat_512_1310c955e7a-53000-56578f4d.dmtcp
[1131] ERROR at dmtcp_restart.cpp:760 in main;
REASON='JASSERT(independentProcessTreeRoots.size() > 0) failed'
Message: There must be at least one process tree that doesn't have
  a different process as session leader.
dmtcp_restart (1131): Terminating...


I tried without the -h and -p options as well (cause I set the
variable DMTCP_HOST) and tried with all the .dmtcp files (in both
nodes) and all gives the same error.

Am I missing something perhaps?

Thanks again for all your help!
Regards,
Marina

On 11/20/15, Jiajun Cao <jia...@ccs.neu.edu> wrote:
> In this case, instead of running the restart script, you can restart from
> the checkpoint images directly, i.e.,
>
> dmtcp_restart -h $coordinator_addr -p $coordinator_port -j ckpt_XX.dmtcp
>
> You will need to run this command n times, where n is the number of
> checkpoint images (number of processes).
>
>
> Best,
> Jiajun
>
> On Thu, Nov 19, 2015 at 10:35 AM, Marina Moran
> <esperandoelmila...@gmail.com
>> wrote:
>
>> Hi Jiajun,
>>
>> You are right, it seems to be the reason, the doesnt connect..
>>
>> Each node write the checkpoint to a directory that is mount with NFS
>> in other two machines. I can see the files in both nodes. I set
>> DMTCP_CHECKPOINT_DIR to that folder. Anyway, I try to wirte the
>> locally, but it is the same problem.
>>
>> I copy the output of 'ls' command on the checkpoint folder on both nodes:
>> node m110a (the one that connects, where the coordinator runs):
>>
>> hpcpro@m110a:~/nfs$ ls -l
>> total 26536
>> -rw------- 1 hpcpro hpcpro 5122030 Nov 19 12:31
>> ckpt_app_heat_512_1310c955e7a-71000-564e0755.dmtcp
>> drwxr-xr-x 2 hpcpro hpcpro    4096 Nov 19 12:31
>> ckpt_app_heat_512_1310c955e7a-71000-564e0755_files
>> -rw------- 1 hpcpro hpcpro 5122218 Nov 19 12:31
>> ckpt_app_heat_512_1310c955e7a-74000-564e0755.dmtcp
>> drwxr-xr-x 2 hpcpro hpcpro    4096 Nov 19 12:31
>> ckpt_app_heat_512_1310c955e7a-74000-564e0755_files
>> -rw------- 1 hpcpro hpcpro 5119200 Nov 19 12:31
>> ckpt_app_heat_512_1310c955e7a-76000-564e0755.dmtcp
>> drwxr-xr-x 2 hpcpro hpcpro    4096 Nov 19 12:31
>> ckpt_app_heat_512_1310c955e7a-76000-564e0755_files
>> -rw------- 1 hpcpro hpcpro 5119579 Nov 19 12:31
>> ckpt_app_heat_512_1310c955e7a-78000-564e0755.dmtcp
>> drwxr-xr-x 2 hpcpro hpcpro    4096 Nov 19 12:31
>> ckpt_app_heat_512_1310c955e7a-78000-564e0755_files
>> -rw------- 1 hpcpro hpcpro 2736819 Nov 19 12:31
>> ckpt_dmtcp_ssh_1310c955e7a-67000-564e0754.dmtcp
>> -rw------- 1 hpcpro hpcpro 3899774 Nov 19 12:31
>> ckpt_orterun_1310c955e7a-66000-564e0754.dmtcp
>> drwxr-xr-x 2 hpcpro hpcpro    4096 Nov 19 12:31
>> ckpt_orterun_1310c955e7a-66000-564e0754_files
>> -rwxr--r-- 1 hpcpro hpcpro   12091 Nov 19 12:31
>> dmtcp_restart_script_1310c955e7a-66000-564e0754.sh
>> lrwxrwxrwx 1 hpcpro hpcpro      67 Nov 19 12:31
>> dmtcp_restart_script.sh ->
>> /home/hpcpro/nfs/dmtcp_restart_script_1310c955e7a-66000-564e0754.sh
>> -rw-r--r-- 1 hpcpro hpcpro     254 Nov 19 12:31 jtimings.csv
>>
>> node m111a (the one doesnt connect)
>> hpcpro@m111a:~/nfs$ ls -l
>> total 26348
>> -rw------- 1 hpcpro hpcpro 5117116 Nov 19 12:31
>> ckpt_app_heat_512_1310c955fc5-72000-564e0755.dmtcp
>> drwxr-xr-x 2 hpcpro hpcpro    4096 Nov 19 12:31
>> ckpt_app_heat_512_1310c955fc5-72000-564e0755_files
>> -rw------- 1 hpcpro hpcpro 5116818 Nov 19 12:31
>> ckpt_app_heat_512_1310c955fc5-73000-564e0755.dmtcp
>> drwxr-xr-x 2 hpcpro hpcpro    4096 Nov 19 12:31
>> ckpt_app_heat_512_1310c955fc5-73000-564e0755_files
>> -rw------- 1 hpcpro hpcpro 5116096 Nov 19 12:31
>> ckpt_app_heat_512_1310c955fc5-75000-564e0755.dmtcp
>> drwxr-xr-x 2 hpcpro hpcpro    4096 Nov 19 12:31
>> ckpt_app_heat_512_1310c955fc5-75000-564e0755_files
>> -rw------- 1 hpcpro hpcpro 5115075 Nov 19 12:31
>> ckpt_app_heat_512_1310c955fc5-77000-564e0755.dmtcp
>> drwxr-xr-x 2 hpcpro hpcpro    4096 Nov 19 12:31
>> ckpt_app_heat_512_1310c955fc5-77000-564e0755_files
>> -rw------- 1 hpcpro hpcpro 2736548 Nov 19 12:31
>> ckpt_dmtcp_sshd_1310c955fc5-69000-564e0755.dmtcp
>> -rw------- 1 hpcpro hpcpro 3742556 Nov 19 12:31
>> ckpt_orted_1310c955fc5-70000-564e0755.dmtcp
>> drwxr-xr-x 2 hpcpro hpcpro    4096 Nov 19 12:31
>> ckpt_orted_1310c955fc5-70000-564e0755_files
>>
>>
>> I cant figure out what can it be... Thanks for your help,
>> regards
>> Marina
>>
>> On 11/18/15, Jiajun Cao <jia...@ccs.neu.edu> wrote:
>> > Hi Marina,
>> >
>> > Where are checkpoint images stored? Are they stored in a shared file
>> > system, or to local storage? From what I can tell from the log,
>> > there're
>> 12
>> > processes before checkpoint, and hence 12 checkpoint images. On
>> > restart,
>> > only 6 of them connect to the coordinator. It may be the fact that the
>> > restart script couldn't find the rest images. Could you verify that?
>> >
>> > Best,
>> > Jiajun
>> >
>> > On Wed, Nov 18, 2015 at 6:03 PM, Marina Moran
>> > <esperandoelmila...@gmail.com>
>> > wrote:
>> >
>> >> Hi All! I'm back using DMTCP!
>> >>
>> >> I'm having a problem when restarting a checkpoint.
>> >>
>> >> I have two nodes (PCs) in an ethernet lan, with:
>> >> -Debian 8 Jessi,
>> >> -DMTCP 2.4.2 (configure with -enable-timing)
>> >> -OpenMPI 1.10.1.
>> >>
>> >> I do:
>> >> $ dmtcp_launch mpirun -np 8 -hostfile hosts app_heat_512
>> >>
>> >> On the console where the coordinator is running, I press 'c' to
>> >> checkpoint. After that, I killed the application or it finished, and
>> >> then, from the same directory where the checkpoints are stored I run
>> >> the restarting script, with the following output:
>> >>
>> >> $ ./dmtcp_restart_script.sh
>> >> [75000] WARNING at socketconnection.cpp:540 in postRestart;
>> >> REASON='JWARNING(_real_bind(_fds[0], (sockaddr*)
>> >> &_bindAddr,_bindAddrlen) == 0) failed'
>> >>      (strerror((*__errno_location ()))) = Address already in use
>> >>      id() = 1310c955e7a-75000-564d1a58(99506)
>> >> Message: Bind failed.
>> >> [77000] WARNING at socketconnection.cpp:540 in postRestart;
>> >> REASON='JWARNING(_real_bind(_fds[0], (sockaddr*)
>> >> &_bindAddr,_bindAddrlen) == 0) failed'
>> >>      (strerror((*__errno_location ()))) = Address already in use
>> >>      id() = 1310c955e7a-77000-564d1a58(99517)
>> >> Message: Bind failed.
>> >>
>> >> On the coordinator console it outputs this:
>> >>
>> >> [762] NOTE at dmtcp_coordinator.cpp:1137 in
>> >> validateRestartingWorkerProcess; REASON='FIRST dmtcp_restart
>> >> connection.  Set numPeers. Generate timestamp'
>> >>      numPeers = 12
>> >>      curTimeStamp = 23166315138
>> >>      compId = 1310c955e7a-66000-564d1a57
>> >> [762] WARNING at jtimer.h:81 in start; REASON='JWARNING(!_isStarted)
>> >> failed'
>> >>      _name = restart
>> >> [762] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
>> >> connected'
>> >>      hello_remote.from = 1310c955e7a-66000-564d1a57
>> >> [762] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
>> >> connected'
>> >>      hello_remote.from = 1310c955e7a-67000-564d1a57
>> >> [762] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
>> >> connected'
>> >>      hello_remote.from = 1310c955e7a-71000-564d1a58
>> >> [762] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
>> >> connected'
>> >>      hello_remote.from = 1310c955e7a-73000-564d1a58
>> >> [762] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
>> >> connected'
>> >>      hello_remote.from = 1310c955e7a-77000-564d1a58
>> >> [762] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
>> >> connected'
>> >>      hello_remote.from = 1310c955e7a-75000-564d1a58
>> >>
>> >>
>> >> And when I pressed L to show connected nodes:
>> >> l
>> >> Client List:
>> >> #, PROG[virtPID:realPID]@HOST, DMTCP-UNIQUEPID, STATE
>> >> 64, orterun[66000:1530]@m110a, 1310c955e7a-66000-564d1a57,
>> >> CHECKPOINTED
>> >> 65, dmtcp_ssh[67000:1618]@m110a, 1310c955e7a-67000-564d1a57,
>> CHECKPOINTED
>> >> 66, app_heat_512[71000:1619]@m110a, 1310c955e7a-71000-564d1a58,
>> >> CHECKPOINTED
>> >> 67, app_heat_512[73000:1620]@m110a, 1310c955e7a-73000-564d1a58,
>> >> CHECKPOINTED
>> >> 68, app_heat_512[77000:1622]@m110a, 1310c955e7a-77000-564d1a58,
>> >> CHECKPOINTED
>> >> 69, app_heat_512[75000:1621]@m110a, 1310c955e7a-75000-564d1a58,
>> >> CHECKPOINTED
>> >>
>> >>
>> >> It seems ti hangs... It never ends.
>> >>
>> >> Hope this is something I forgot...
>> >>
>> >> Thanks all in advance,
>> >> Regards
>> >> Marina
>> >>
>> >>
>> >>
>> ------------------------------------------------------------------------------
>> >> _______________________________________________
>> >> Dmtcp-forum mailing list
>> >> Dmtcp-forum@lists.sourceforge.net
>> >> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
>> >>
>> >
>>
>

------------------------------------------------------------------------------
Go from Idea to Many App Stores Faster with Intel(R) XDK
Give your users amazing mobile app experiences with Intel(R) XDK.
Use one codebase in this all-in-one HTML5 development environment.
Design, debug & build mobile apps & 2D/3D high-impact games for multiple OSs.
http://pubads.g.doubleclick.net/gampad/clk?id=254741551&iu=/4140
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to