Hi All! I'm back using DMTCP!

I'm having a problem when restarting a checkpoint.

I have two nodes (PCs) in an ethernet lan, with:
-Debian 8 Jessi,
-DMTCP 2.4.2 (configure with -enable-timing)
-OpenMPI 1.10.1.

I do:
$ dmtcp_launch mpirun -np 8 -hostfile hosts app_heat_512

On the console where the coordinator is running, I press 'c' to
checkpoint. After that, I killed the application or it finished, and
then, from the same directory where the checkpoints are stored I run
the restarting script, with the following output:

$ ./dmtcp_restart_script.sh
[75000] WARNING at socketconnection.cpp:540 in postRestart;
REASON='JWARNING(_real_bind(_fds[0], (sockaddr*)
&_bindAddr,_bindAddrlen) == 0) failed'
     (strerror((*__errno_location ()))) = Address already in use
     id() = 1310c955e7a-75000-564d1a58(99506)
Message: Bind failed.
[77000] WARNING at socketconnection.cpp:540 in postRestart;
REASON='JWARNING(_real_bind(_fds[0], (sockaddr*)
&_bindAddr,_bindAddrlen) == 0) failed'
     (strerror((*__errno_location ()))) = Address already in use
     id() = 1310c955e7a-77000-564d1a58(99517)
Message: Bind failed.

On the coordinator console it outputs this:

[762] NOTE at dmtcp_coordinator.cpp:1137 in
validateRestartingWorkerProcess; REASON='FIRST dmtcp_restart
connection.  Set numPeers. Generate timestamp'
     numPeers = 12
     curTimeStamp = 23166315138
     compId = 1310c955e7a-66000-564d1a57
[762] WARNING at jtimer.h:81 in start; REASON='JWARNING(!_isStarted) failed'
     _name = restart
[762] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker connected'
     hello_remote.from = 1310c955e7a-66000-564d1a57
[762] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker connected'
     hello_remote.from = 1310c955e7a-67000-564d1a57
[762] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker connected'
     hello_remote.from = 1310c955e7a-71000-564d1a58
[762] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker connected'
     hello_remote.from = 1310c955e7a-73000-564d1a58
[762] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker connected'
     hello_remote.from = 1310c955e7a-77000-564d1a58
[762] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker connected'
     hello_remote.from = 1310c955e7a-75000-564d1a58


And when I pressed L to show connected nodes:
l
Client List:
#, PROG[virtPID:realPID]@HOST, DMTCP-UNIQUEPID, STATE
64, orterun[66000:1530]@m110a, 1310c955e7a-66000-564d1a57, CHECKPOINTED
65, dmtcp_ssh[67000:1618]@m110a, 1310c955e7a-67000-564d1a57, CHECKPOINTED
66, app_heat_512[71000:1619]@m110a, 1310c955e7a-71000-564d1a58, CHECKPOINTED
67, app_heat_512[73000:1620]@m110a, 1310c955e7a-73000-564d1a58, CHECKPOINTED
68, app_heat_512[77000:1622]@m110a, 1310c955e7a-77000-564d1a58, CHECKPOINTED
69, app_heat_512[75000:1621]@m110a, 1310c955e7a-75000-564d1a58, CHECKPOINTED


It seems ti hangs... It never ends.

Hope this is something I forgot...

Thanks all in advance,
Regards
Marina

------------------------------------------------------------------------------
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to