Hi Marina,
I remember in the past you gave me the access to your cluster. I tried to
login just now, but it didn't work. Could you check if the approach you
told me still works? This would be the best way to diagnose the problem.
Best,
Jiajun
On Tue, Dec 1, 2015 at 2:16 PM, Marina Moran <esperandoelmila...@gmail.com>
wrote:
> Hi again...
>
> sorry I am again asking about this, but I really need to get the
> restart working... and I don't know where to look for... I think I am
> doing the correct things, but it doesn't work..
>
> Any advice will be very appreciated!
>
> best regards,
> Marina
>
> On 11/26/15, Marina Moran <esperandoelmila...@gmail.com> wrote:
> > Hi Jiajun:
> >
> > I tried what you told me, here is the output:
> >
> > $ dmtcp_restart -h 10.0.2.20 -p 7779 -j
> > ckpt_app_heat_512_1310c955e7a-53000-56578f4d.dmtcp
> > [1131] ERROR at dmtcp_restart.cpp:760 in main;
> > REASON='JASSERT(independentProcessTreeRoots.size() > 0) failed'
> > Message: There must be at least one process tree that doesn't have
> > a different process as session leader.
> > dmtcp_restart (1131): Terminating...
> >
> >
> > I tried without the -h and -p options as well (cause I set the
> > variable DMTCP_HOST) and tried with all the .dmtcp files (in both
> > nodes) and all gives the same error.
> >
> > Am I missing something perhaps?
> >
> > Thanks again for all your help!
> > Regards,
> > Marina
> >
> > On 11/20/15, Jiajun Cao <jia...@ccs.neu.edu> wrote:
> >> In this case, instead of running the restart script, you can restart
> from
> >> the checkpoint images directly, i.e.,
> >>
> >> dmtcp_restart -h $coordinator_addr -p $coordinator_port -j ckpt_XX.dmtcp
> >>
> >> You will need to run this command n times, where n is the number of
> >> checkpoint images (number of processes).
> >>
> >>
> >> Best,
> >> Jiajun
> >>
> >> On Thu, Nov 19, 2015 at 10:35 AM, Marina Moran
> >> <esperandoelmila...@gmail.com
> >>> wrote:
> >>
> >>> Hi Jiajun,
> >>>
> >>> You are right, it seems to be the reason, the doesnt connect..
> >>>
> >>> Each node write the checkpoint to a directory that is mount with NFS
> >>> in other two machines. I can see the files in both nodes. I set
> >>> DMTCP_CHECKPOINT_DIR to that folder. Anyway, I try to wirte the
> >>> locally, but it is the same problem.
> >>>
> >>> I copy the output of 'ls' command on the checkpoint folder on both
> >>> nodes:
> >>> node m110a (the one that connects, where the coordinator runs):
> >>>
> >>> hpcpro@m110a:~/nfs$ ls -l
> >>> total 26536
> >>> -rw------- 1 hpcpro hpcpro 5122030 Nov 19 12:31
> >>> ckpt_app_heat_512_1310c955e7a-71000-564e0755.dmtcp
> >>> drwxr-xr-x 2 hpcpro hpcpro 4096 Nov 19 12:31
> >>> ckpt_app_heat_512_1310c955e7a-71000-564e0755_files
> >>> -rw------- 1 hpcpro hpcpro 5122218 Nov 19 12:31
> >>> ckpt_app_heat_512_1310c955e7a-74000-564e0755.dmtcp
> >>> drwxr-xr-x 2 hpcpro hpcpro 4096 Nov 19 12:31
> >>> ckpt_app_heat_512_1310c955e7a-74000-564e0755_files
> >>> -rw------- 1 hpcpro hpcpro 5119200 Nov 19 12:31
> >>> ckpt_app_heat_512_1310c955e7a-76000-564e0755.dmtcp
> >>> drwxr-xr-x 2 hpcpro hpcpro 4096 Nov 19 12:31
> >>> ckpt_app_heat_512_1310c955e7a-76000-564e0755_files
> >>> -rw------- 1 hpcpro hpcpro 5119579 Nov 19 12:31
> >>> ckpt_app_heat_512_1310c955e7a-78000-564e0755.dmtcp
> >>> drwxr-xr-x 2 hpcpro hpcpro 4096 Nov 19 12:31
> >>> ckpt_app_heat_512_1310c955e7a-78000-564e0755_files
> >>> -rw------- 1 hpcpro hpcpro 2736819 Nov 19 12:31
> >>> ckpt_dmtcp_ssh_1310c955e7a-67000-564e0754.dmtcp
> >>> -rw------- 1 hpcpro hpcpro 3899774 Nov 19 12:31
> >>> ckpt_orterun_1310c955e7a-66000-564e0754.dmtcp
> >>> drwxr-xr-x 2 hpcpro hpcpro 4096 Nov 19 12:31
> >>> ckpt_orterun_1310c955e7a-66000-564e0754_files
> >>> -rwxr--r-- 1 hpcpro hpcpro 12091 Nov 19 12:31
> >>> dmtcp_restart_script_1310c955e7a-66000-564e0754.sh
> >>> lrwxrwxrwx 1 hpcpro hpcpro 67 Nov 19 12:31
> >>> dmtcp_restart_script.sh ->
> >>> /home/hpcpro/nfs/dmtcp_restart_script_1310c955e7a-66000-564e0754.sh
> >>> -rw-r--r-- 1 hpcpro hpcpro 254 Nov 19 12:31 jtimings.csv
> >>>
> >>> node m111a (the one doesnt connect)
> >>> hpcpro@m111a:~/nfs$ ls -l
> >>> total 26348
> >>> -rw------- 1 hpcpro hpcpro 5117116 Nov 19 12:31
> >>> ckpt_app_heat_512_1310c955fc5-72000-564e0755.dmtcp
> >>> drwxr-xr-x 2 hpcpro hpcpro 4096 Nov 19 12:31
> >>> ckpt_app_heat_512_1310c955fc5-72000-564e0755_files
> >>> -rw------- 1 hpcpro hpcpro 5116818 Nov 19 12:31
> >>> ckpt_app_heat_512_1310c955fc5-73000-564e0755.dmtcp
> >>> drwxr-xr-x 2 hpcpro hpcpro 4096 Nov 19 12:31
> >>> ckpt_app_heat_512_1310c955fc5-73000-564e0755_files
> >>> -rw------- 1 hpcpro hpcpro 5116096 Nov 19 12:31
> >>> ckpt_app_heat_512_1310c955fc5-75000-564e0755.dmtcp
> >>> drwxr-xr-x 2 hpcpro hpcpro 4096 Nov 19 12:31
> >>> ckpt_app_heat_512_1310c955fc5-75000-564e0755_files
> >>> -rw------- 1 hpcpro hpcpro 5115075 Nov 19 12:31
> >>> ckpt_app_heat_512_1310c955fc5-77000-564e0755.dmtcp
> >>> drwxr-xr-x 2 hpcpro hpcpro 4096 Nov 19 12:31
> >>> ckpt_app_heat_512_1310c955fc5-77000-564e0755_files
> >>> -rw------- 1 hpcpro hpcpro 2736548 Nov 19 12:31
> >>> ckpt_dmtcp_sshd_1310c955fc5-69000-564e0755.dmtcp
> >>> -rw------- 1 hpcpro hpcpro 3742556 Nov 19 12:31
> >>> ckpt_orted_1310c955fc5-70000-564e0755.dmtcp
> >>> drwxr-xr-x 2 hpcpro hpcpro 4096 Nov 19 12:31
> >>> ckpt_orted_1310c955fc5-70000-564e0755_files
> >>>
> >>>
> >>> I cant figure out what can it be... Thanks for your help,
> >>> regards
> >>> Marina
> >>>
> >>> On 11/18/15, Jiajun Cao <jia...@ccs.neu.edu> wrote:
> >>> > Hi Marina,
> >>> >
> >>> > Where are checkpoint images stored? Are they stored in a shared file
> >>> > system, or to local storage? From what I can tell from the log,
> >>> > there're
> >>> 12
> >>> > processes before checkpoint, and hence 12 checkpoint images. On
> >>> > restart,
> >>> > only 6 of them connect to the coordinator. It may be the fact that
> the
> >>> > restart script couldn't find the rest images. Could you verify that?
> >>> >
> >>> > Best,
> >>> > Jiajun
> >>> >
> >>> > On Wed, Nov 18, 2015 at 6:03 PM, Marina Moran
> >>> > <esperandoelmila...@gmail.com>
> >>> > wrote:
> >>> >
> >>> >> Hi All! I'm back using DMTCP!
> >>> >>
> >>> >> I'm having a problem when restarting a checkpoint.
> >>> >>
> >>> >> I have two nodes (PCs) in an ethernet lan, with:
> >>> >> -Debian 8 Jessi,
> >>> >> -DMTCP 2.4.2 (configure with -enable-timing)
> >>> >> -OpenMPI 1.10.1.
> >>> >>
> >>> >> I do:
> >>> >> $ dmtcp_launch mpirun -np 8 -hostfile hosts app_heat_512
> >>> >>
> >>> >> On the console where the coordinator is running, I press 'c' to
> >>> >> checkpoint. After that, I killed the application or it finished, and
> >>> >> then, from the same directory where the checkpoints are stored I run
> >>> >> the restarting script, with the following output:
> >>> >>
> >>> >> $ ./dmtcp_restart_script.sh
> >>> >> [75000] WARNING at socketconnection.cpp:540 in postRestart;
> >>> >> REASON='JWARNING(_real_bind(_fds[0], (sockaddr*)
> >>> >> &_bindAddr,_bindAddrlen) == 0) failed'
> >>> >> (strerror((*__errno_location ()))) = Address already in use
> >>> >> id() = 1310c955e7a-75000-564d1a58(99506)
> >>> >> Message: Bind failed.
> >>> >> [77000] WARNING at socketconnection.cpp:540 in postRestart;
> >>> >> REASON='JWARNING(_real_bind(_fds[0], (sockaddr*)
> >>> >> &_bindAddr,_bindAddrlen) == 0) failed'
> >>> >> (strerror((*__errno_location ()))) = Address already in use
> >>> >> id() = 1310c955e7a-77000-564d1a58(99517)
> >>> >> Message: Bind failed.
> >>> >>
> >>> >> On the coordinator console it outputs this:
> >>> >>
> >>> >> [762] NOTE at dmtcp_coordinator.cpp:1137 in
> >>> >> validateRestartingWorkerProcess; REASON='FIRST dmtcp_restart
> >>> >> connection. Set numPeers. Generate timestamp'
> >>> >> numPeers = 12
> >>> >> curTimeStamp = 23166315138
> >>> >> compId = 1310c955e7a-66000-564d1a57
> >>> >> [762] WARNING at jtimer.h:81 in start; REASON='JWARNING(!_isStarted)
> >>> >> failed'
> >>> >> _name = restart
> >>> >> [762] NOTE at dmtcp_coordinator.cpp:1079 in onConnect;
> REASON='worker
> >>> >> connected'
> >>> >> hello_remote.from = 1310c955e7a-66000-564d1a57
> >>> >> [762] NOTE at dmtcp_coordinator.cpp:1079 in onConnect;
> REASON='worker
> >>> >> connected'
> >>> >> hello_remote.from = 1310c955e7a-67000-564d1a57
> >>> >> [762] NOTE at dmtcp_coordinator.cpp:1079 in onConnect;
> REASON='worker
> >>> >> connected'
> >>> >> hello_remote.from = 1310c955e7a-71000-564d1a58
> >>> >> [762] NOTE at dmtcp_coordinator.cpp:1079 in onConnect;
> REASON='worker
> >>> >> connected'
> >>> >> hello_remote.from = 1310c955e7a-73000-564d1a58
> >>> >> [762] NOTE at dmtcp_coordinator.cpp:1079 in onConnect;
> REASON='worker
> >>> >> connected'
> >>> >> hello_remote.from = 1310c955e7a-77000-564d1a58
> >>> >> [762] NOTE at dmtcp_coordinator.cpp:1079 in onConnect;
> REASON='worker
> >>> >> connected'
> >>> >> hello_remote.from = 1310c955e7a-75000-564d1a58
> >>> >>
> >>> >>
> >>> >> And when I pressed L to show connected nodes:
> >>> >> l
> >>> >> Client List:
> >>> >> #, PROG[virtPID:realPID]@HOST, DMTCP-UNIQUEPID, STATE
> >>> >> 64, orterun[66000:1530]@m110a, 1310c955e7a-66000-564d1a57,
> >>> >> CHECKPOINTED
> >>> >> 65, dmtcp_ssh[67000:1618]@m110a, 1310c955e7a-67000-564d1a57,
> >>> CHECKPOINTED
> >>> >> 66, app_heat_512[71000:1619]@m110a, 1310c955e7a-71000-564d1a58,
> >>> >> CHECKPOINTED
> >>> >> 67, app_heat_512[73000:1620]@m110a, 1310c955e7a-73000-564d1a58,
> >>> >> CHECKPOINTED
> >>> >> 68, app_heat_512[77000:1622]@m110a, 1310c955e7a-77000-564d1a58,
> >>> >> CHECKPOINTED
> >>> >> 69, app_heat_512[75000:1621]@m110a, 1310c955e7a-75000-564d1a58,
> >>> >> CHECKPOINTED
> >>> >>
> >>> >>
> >>> >> It seems ti hangs... It never ends.
> >>> >>
> >>> >> Hope this is something I forgot...
> >>> >>
> >>> >> Thanks all in advance,
> >>> >> Regards
> >>> >> Marina
> >>> >>
> >>> >>
> >>> >>
> >>>
> ------------------------------------------------------------------------------
> >>> >> _______________________________________________
> >>> >> Dmtcp-forum mailing list
> >>> >> Dmtcp-forum@lists.sourceforge.net
> >>> >> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
> >>> >>
> >>> >
> >>>
> >>
> >
>
------------------------------------------------------------------------------
Go from Idea to Many App Stores Faster with Intel(R) XDK
Give your users amazing mobile app experiences with Intel(R) XDK.
Use one codebase in this all-in-one HTML5 development environment.
Design, debug & build mobile apps & 2D/3D high-impact games for multiple OSs.
http://pubads.g.doubleclick.net/gampad/clk?id=254741911&iu=/4140
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum