Hi All! I'm back using DMTCP! I'm having a problem when restarting a checkpoint.
I have two nodes (PCs) in an ethernet lan, with: -Debian 8 Jessi, -DMTCP 2.4.2 (configure with -enable-timing) -OpenMPI 1.10.1. I do: $ dmtcp_launch mpirun -np 8 -hostfile hosts app_heat_512 On the console where the coordinator is running, I press 'c' to checkpoint. After that, I killed the application or it finished, and then, from the same directory where the checkpoints are stored I run the restarting script, with the following output: $ ./dmtcp_restart_script.sh [75000] WARNING at socketconnection.cpp:540 in postRestart; REASON='JWARNING(_real_bind(_fds[0], (sockaddr*) &_bindAddr,_bindAddrlen) == 0) failed' (strerror((*__errno_location ()))) = Address already in use id() = 1310c955e7a-75000-564d1a58(99506) Message: Bind failed. [77000] WARNING at socketconnection.cpp:540 in postRestart; REASON='JWARNING(_real_bind(_fds[0], (sockaddr*) &_bindAddr,_bindAddrlen) == 0) failed' (strerror((*__errno_location ()))) = Address already in use id() = 1310c955e7a-77000-564d1a58(99517) Message: Bind failed. On the coordinator console it outputs this: [762] NOTE at dmtcp_coordinator.cpp:1137 in validateRestartingWorkerProcess; REASON='FIRST dmtcp_restart connection. Set numPeers. Generate timestamp' numPeers = 12 curTimeStamp = 23166315138 compId = 1310c955e7a-66000-564d1a57 [762] WARNING at jtimer.h:81 in start; REASON='JWARNING(!_isStarted) failed' _name = restart [762] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker connected' hello_remote.from = 1310c955e7a-66000-564d1a57 [762] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker connected' hello_remote.from = 1310c955e7a-67000-564d1a57 [762] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker connected' hello_remote.from = 1310c955e7a-71000-564d1a58 [762] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker connected' hello_remote.from = 1310c955e7a-73000-564d1a58 [762] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker connected' hello_remote.from = 1310c955e7a-77000-564d1a58 [762] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker connected' hello_remote.from = 1310c955e7a-75000-564d1a58 And when I pressed L to show connected nodes: l Client List: #, PROG[virtPID:realPID]@HOST, DMTCP-UNIQUEPID, STATE 64, orterun[66000:1530]@m110a, 1310c955e7a-66000-564d1a57, CHECKPOINTED 65, dmtcp_ssh[67000:1618]@m110a, 1310c955e7a-67000-564d1a57, CHECKPOINTED 66, app_heat_512[71000:1619]@m110a, 1310c955e7a-71000-564d1a58, CHECKPOINTED 67, app_heat_512[73000:1620]@m110a, 1310c955e7a-73000-564d1a58, CHECKPOINTED 68, app_heat_512[77000:1622]@m110a, 1310c955e7a-77000-564d1a58, CHECKPOINTED 69, app_heat_512[75000:1621]@m110a, 1310c955e7a-75000-564d1a58, CHECKPOINTED It seems ti hangs... It never ends. Hope this is something I forgot... Thanks all in advance, Regards Marina ------------------------------------------------------------------------------ _______________________________________________ Dmtcp-forum mailing list Dmtcp-forum@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dmtcp-forum