Hi again...

sorry I am again asking about this, but I really need to get the
restart working... and I don't know where to look for... I think I am
doing the correct things, but it doesn't work..

Any advice will be very appreciated!

best regards,
Marina

On 11/26/15, Marina Moran <esperandoelmila...@gmail.com> wrote:
> Hi Jiajun:
>
> I tried what you told me, here is the output:
>
> $ dmtcp_restart -h 10.0.2.20 -p 7779 -j
> ckpt_app_heat_512_1310c955e7a-53000-56578f4d.dmtcp
> [1131] ERROR at dmtcp_restart.cpp:760 in main;
> REASON='JASSERT(independentProcessTreeRoots.size() > 0) failed'
> Message: There must be at least one process tree that doesn't have
>   a different process as session leader.
> dmtcp_restart (1131): Terminating...
>
>
> I tried without the -h and -p options as well (cause I set the
> variable DMTCP_HOST) and tried with all the .dmtcp files (in both
> nodes) and all gives the same error.
>
> Am I missing something perhaps?
>
> Thanks again for all your help!
> Regards,
> Marina
>
> On 11/20/15, Jiajun Cao <jia...@ccs.neu.edu> wrote:
>> In this case, instead of running the restart script, you can restart from
>> the checkpoint images directly, i.e.,
>>
>> dmtcp_restart -h $coordinator_addr -p $coordinator_port -j ckpt_XX.dmtcp
>>
>> You will need to run this command n times, where n is the number of
>> checkpoint images (number of processes).
>>
>>
>> Best,
>> Jiajun
>>
>> On Thu, Nov 19, 2015 at 10:35 AM, Marina Moran
>> <esperandoelmila...@gmail.com
>>> wrote:
>>
>>> Hi Jiajun,
>>>
>>> You are right, it seems to be the reason, the doesnt connect..
>>>
>>> Each node write the checkpoint to a directory that is mount with NFS
>>> in other two machines. I can see the files in both nodes. I set
>>> DMTCP_CHECKPOINT_DIR to that folder. Anyway, I try to wirte the
>>> locally, but it is the same problem.
>>>
>>> I copy the output of 'ls' command on the checkpoint folder on both
>>> nodes:
>>> node m110a (the one that connects, where the coordinator runs):
>>>
>>> hpcpro@m110a:~/nfs$ ls -l
>>> total 26536
>>> -rw------- 1 hpcpro hpcpro 5122030 Nov 19 12:31
>>> ckpt_app_heat_512_1310c955e7a-71000-564e0755.dmtcp
>>> drwxr-xr-x 2 hpcpro hpcpro    4096 Nov 19 12:31
>>> ckpt_app_heat_512_1310c955e7a-71000-564e0755_files
>>> -rw------- 1 hpcpro hpcpro 5122218 Nov 19 12:31
>>> ckpt_app_heat_512_1310c955e7a-74000-564e0755.dmtcp
>>> drwxr-xr-x 2 hpcpro hpcpro    4096 Nov 19 12:31
>>> ckpt_app_heat_512_1310c955e7a-74000-564e0755_files
>>> -rw------- 1 hpcpro hpcpro 5119200 Nov 19 12:31
>>> ckpt_app_heat_512_1310c955e7a-76000-564e0755.dmtcp
>>> drwxr-xr-x 2 hpcpro hpcpro    4096 Nov 19 12:31
>>> ckpt_app_heat_512_1310c955e7a-76000-564e0755_files
>>> -rw------- 1 hpcpro hpcpro 5119579 Nov 19 12:31
>>> ckpt_app_heat_512_1310c955e7a-78000-564e0755.dmtcp
>>> drwxr-xr-x 2 hpcpro hpcpro    4096 Nov 19 12:31
>>> ckpt_app_heat_512_1310c955e7a-78000-564e0755_files
>>> -rw------- 1 hpcpro hpcpro 2736819 Nov 19 12:31
>>> ckpt_dmtcp_ssh_1310c955e7a-67000-564e0754.dmtcp
>>> -rw------- 1 hpcpro hpcpro 3899774 Nov 19 12:31
>>> ckpt_orterun_1310c955e7a-66000-564e0754.dmtcp
>>> drwxr-xr-x 2 hpcpro hpcpro    4096 Nov 19 12:31
>>> ckpt_orterun_1310c955e7a-66000-564e0754_files
>>> -rwxr--r-- 1 hpcpro hpcpro   12091 Nov 19 12:31
>>> dmtcp_restart_script_1310c955e7a-66000-564e0754.sh
>>> lrwxrwxrwx 1 hpcpro hpcpro      67 Nov 19 12:31
>>> dmtcp_restart_script.sh ->
>>> /home/hpcpro/nfs/dmtcp_restart_script_1310c955e7a-66000-564e0754.sh
>>> -rw-r--r-- 1 hpcpro hpcpro     254 Nov 19 12:31 jtimings.csv
>>>
>>> node m111a (the one doesnt connect)
>>> hpcpro@m111a:~/nfs$ ls -l
>>> total 26348
>>> -rw------- 1 hpcpro hpcpro 5117116 Nov 19 12:31
>>> ckpt_app_heat_512_1310c955fc5-72000-564e0755.dmtcp
>>> drwxr-xr-x 2 hpcpro hpcpro    4096 Nov 19 12:31
>>> ckpt_app_heat_512_1310c955fc5-72000-564e0755_files
>>> -rw------- 1 hpcpro hpcpro 5116818 Nov 19 12:31
>>> ckpt_app_heat_512_1310c955fc5-73000-564e0755.dmtcp
>>> drwxr-xr-x 2 hpcpro hpcpro    4096 Nov 19 12:31
>>> ckpt_app_heat_512_1310c955fc5-73000-564e0755_files
>>> -rw------- 1 hpcpro hpcpro 5116096 Nov 19 12:31
>>> ckpt_app_heat_512_1310c955fc5-75000-564e0755.dmtcp
>>> drwxr-xr-x 2 hpcpro hpcpro    4096 Nov 19 12:31
>>> ckpt_app_heat_512_1310c955fc5-75000-564e0755_files
>>> -rw------- 1 hpcpro hpcpro 5115075 Nov 19 12:31
>>> ckpt_app_heat_512_1310c955fc5-77000-564e0755.dmtcp
>>> drwxr-xr-x 2 hpcpro hpcpro    4096 Nov 19 12:31
>>> ckpt_app_heat_512_1310c955fc5-77000-564e0755_files
>>> -rw------- 1 hpcpro hpcpro 2736548 Nov 19 12:31
>>> ckpt_dmtcp_sshd_1310c955fc5-69000-564e0755.dmtcp
>>> -rw------- 1 hpcpro hpcpro 3742556 Nov 19 12:31
>>> ckpt_orted_1310c955fc5-70000-564e0755.dmtcp
>>> drwxr-xr-x 2 hpcpro hpcpro    4096 Nov 19 12:31
>>> ckpt_orted_1310c955fc5-70000-564e0755_files
>>>
>>>
>>> I cant figure out what can it be... Thanks for your help,
>>> regards
>>> Marina
>>>
>>> On 11/18/15, Jiajun Cao <jia...@ccs.neu.edu> wrote:
>>> > Hi Marina,
>>> >
>>> > Where are checkpoint images stored? Are they stored in a shared file
>>> > system, or to local storage? From what I can tell from the log,
>>> > there're
>>> 12
>>> > processes before checkpoint, and hence 12 checkpoint images. On
>>> > restart,
>>> > only 6 of them connect to the coordinator. It may be the fact that the
>>> > restart script couldn't find the rest images. Could you verify that?
>>> >
>>> > Best,
>>> > Jiajun
>>> >
>>> > On Wed, Nov 18, 2015 at 6:03 PM, Marina Moran
>>> > <esperandoelmila...@gmail.com>
>>> > wrote:
>>> >
>>> >> Hi All! I'm back using DMTCP!
>>> >>
>>> >> I'm having a problem when restarting a checkpoint.
>>> >>
>>> >> I have two nodes (PCs) in an ethernet lan, with:
>>> >> -Debian 8 Jessi,
>>> >> -DMTCP 2.4.2 (configure with -enable-timing)
>>> >> -OpenMPI 1.10.1.
>>> >>
>>> >> I do:
>>> >> $ dmtcp_launch mpirun -np 8 -hostfile hosts app_heat_512
>>> >>
>>> >> On the console where the coordinator is running, I press 'c' to
>>> >> checkpoint. After that, I killed the application or it finished, and
>>> >> then, from the same directory where the checkpoints are stored I run
>>> >> the restarting script, with the following output:
>>> >>
>>> >> $ ./dmtcp_restart_script.sh
>>> >> [75000] WARNING at socketconnection.cpp:540 in postRestart;
>>> >> REASON='JWARNING(_real_bind(_fds[0], (sockaddr*)
>>> >> &_bindAddr,_bindAddrlen) == 0) failed'
>>> >>      (strerror((*__errno_location ()))) = Address already in use
>>> >>      id() = 1310c955e7a-75000-564d1a58(99506)
>>> >> Message: Bind failed.
>>> >> [77000] WARNING at socketconnection.cpp:540 in postRestart;
>>> >> REASON='JWARNING(_real_bind(_fds[0], (sockaddr*)
>>> >> &_bindAddr,_bindAddrlen) == 0) failed'
>>> >>      (strerror((*__errno_location ()))) = Address already in use
>>> >>      id() = 1310c955e7a-77000-564d1a58(99517)
>>> >> Message: Bind failed.
>>> >>
>>> >> On the coordinator console it outputs this:
>>> >>
>>> >> [762] NOTE at dmtcp_coordinator.cpp:1137 in
>>> >> validateRestartingWorkerProcess; REASON='FIRST dmtcp_restart
>>> >> connection.  Set numPeers. Generate timestamp'
>>> >>      numPeers = 12
>>> >>      curTimeStamp = 23166315138
>>> >>      compId = 1310c955e7a-66000-564d1a57
>>> >> [762] WARNING at jtimer.h:81 in start; REASON='JWARNING(!_isStarted)
>>> >> failed'
>>> >>      _name = restart
>>> >> [762] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
>>> >> connected'
>>> >>      hello_remote.from = 1310c955e7a-66000-564d1a57
>>> >> [762] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
>>> >> connected'
>>> >>      hello_remote.from = 1310c955e7a-67000-564d1a57
>>> >> [762] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
>>> >> connected'
>>> >>      hello_remote.from = 1310c955e7a-71000-564d1a58
>>> >> [762] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
>>> >> connected'
>>> >>      hello_remote.from = 1310c955e7a-73000-564d1a58
>>> >> [762] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
>>> >> connected'
>>> >>      hello_remote.from = 1310c955e7a-77000-564d1a58
>>> >> [762] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
>>> >> connected'
>>> >>      hello_remote.from = 1310c955e7a-75000-564d1a58
>>> >>
>>> >>
>>> >> And when I pressed L to show connected nodes:
>>> >> l
>>> >> Client List:
>>> >> #, PROG[virtPID:realPID]@HOST, DMTCP-UNIQUEPID, STATE
>>> >> 64, orterun[66000:1530]@m110a, 1310c955e7a-66000-564d1a57,
>>> >> CHECKPOINTED
>>> >> 65, dmtcp_ssh[67000:1618]@m110a, 1310c955e7a-67000-564d1a57,
>>> CHECKPOINTED
>>> >> 66, app_heat_512[71000:1619]@m110a, 1310c955e7a-71000-564d1a58,
>>> >> CHECKPOINTED
>>> >> 67, app_heat_512[73000:1620]@m110a, 1310c955e7a-73000-564d1a58,
>>> >> CHECKPOINTED
>>> >> 68, app_heat_512[77000:1622]@m110a, 1310c955e7a-77000-564d1a58,
>>> >> CHECKPOINTED
>>> >> 69, app_heat_512[75000:1621]@m110a, 1310c955e7a-75000-564d1a58,
>>> >> CHECKPOINTED
>>> >>
>>> >>
>>> >> It seems ti hangs... It never ends.
>>> >>
>>> >> Hope this is something I forgot...
>>> >>
>>> >> Thanks all in advance,
>>> >> Regards
>>> >> Marina
>>> >>
>>> >>
>>> >>
>>> ------------------------------------------------------------------------------
>>> >> _______________________________________________
>>> >> Dmtcp-forum mailing list
>>> >> Dmtcp-forum@lists.sourceforge.net
>>> >> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
>>> >>
>>> >
>>>
>>
>

------------------------------------------------------------------------------
Go from Idea to Many App Stores Faster with Intel(R) XDK
Give your users amazing mobile app experiences with Intel(R) XDK.
Use one codebase in this all-in-one HTML5 development environment.
Design, debug & build mobile apps & 2D/3D high-impact games for multiple OSs.
http://pubads.g.doubleclick.net/gampad/clk?id=254741911&iu=/4140
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to