Hi Marina, Where are checkpoint images stored? Are they stored in a shared file system, or to local storage? From what I can tell from the log, there're 12 processes before checkpoint, and hence 12 checkpoint images. On restart, only 6 of them connect to the coordinator. It may be the fact that the restart script couldn't find the rest images. Could you verify that?
Best, Jiajun On Wed, Nov 18, 2015 at 6:03 PM, Marina Moran <esperandoelmila...@gmail.com> wrote: > Hi All! I'm back using DMTCP! > > I'm having a problem when restarting a checkpoint. > > I have two nodes (PCs) in an ethernet lan, with: > -Debian 8 Jessi, > -DMTCP 2.4.2 (configure with -enable-timing) > -OpenMPI 1.10.1. > > I do: > $ dmtcp_launch mpirun -np 8 -hostfile hosts app_heat_512 > > On the console where the coordinator is running, I press 'c' to > checkpoint. After that, I killed the application or it finished, and > then, from the same directory where the checkpoints are stored I run > the restarting script, with the following output: > > $ ./dmtcp_restart_script.sh > [75000] WARNING at socketconnection.cpp:540 in postRestart; > REASON='JWARNING(_real_bind(_fds[0], (sockaddr*) > &_bindAddr,_bindAddrlen) == 0) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 1310c955e7a-75000-564d1a58(99506) > Message: Bind failed. > [77000] WARNING at socketconnection.cpp:540 in postRestart; > REASON='JWARNING(_real_bind(_fds[0], (sockaddr*) > &_bindAddr,_bindAddrlen) == 0) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 1310c955e7a-77000-564d1a58(99517) > Message: Bind failed. > > On the coordinator console it outputs this: > > [762] NOTE at dmtcp_coordinator.cpp:1137 in > validateRestartingWorkerProcess; REASON='FIRST dmtcp_restart > connection. Set numPeers. Generate timestamp' > numPeers = 12 > curTimeStamp = 23166315138 > compId = 1310c955e7a-66000-564d1a57 > [762] WARNING at jtimer.h:81 in start; REASON='JWARNING(!_isStarted) > failed' > _name = restart > [762] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker > connected' > hello_remote.from = 1310c955e7a-66000-564d1a57 > [762] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker > connected' > hello_remote.from = 1310c955e7a-67000-564d1a57 > [762] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker > connected' > hello_remote.from = 1310c955e7a-71000-564d1a58 > [762] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker > connected' > hello_remote.from = 1310c955e7a-73000-564d1a58 > [762] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker > connected' > hello_remote.from = 1310c955e7a-77000-564d1a58 > [762] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker > connected' > hello_remote.from = 1310c955e7a-75000-564d1a58 > > > And when I pressed L to show connected nodes: > l > Client List: > #, PROG[virtPID:realPID]@HOST, DMTCP-UNIQUEPID, STATE > 64, orterun[66000:1530]@m110a, 1310c955e7a-66000-564d1a57, CHECKPOINTED > 65, dmtcp_ssh[67000:1618]@m110a, 1310c955e7a-67000-564d1a57, CHECKPOINTED > 66, app_heat_512[71000:1619]@m110a, 1310c955e7a-71000-564d1a58, > CHECKPOINTED > 67, app_heat_512[73000:1620]@m110a, 1310c955e7a-73000-564d1a58, > CHECKPOINTED > 68, app_heat_512[77000:1622]@m110a, 1310c955e7a-77000-564d1a58, > CHECKPOINTED > 69, app_heat_512[75000:1621]@m110a, 1310c955e7a-75000-564d1a58, > CHECKPOINTED > > > It seems ti hangs... It never ends. > > Hope this is something I forgot... > > Thanks all in advance, > Regards > Marina > > > ------------------------------------------------------------------------------ > _______________________________________________ > Dmtcp-forum mailing list > Dmtcp-forum@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum >
------------------------------------------------------------------------------
_______________________________________________ Dmtcp-forum mailing list Dmtcp-forum@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dmtcp-forum