Great. Now, I know what to fix. I will write you back once I have uploaded the patch.
Kapil On Mon, Oct 27, 2014 at 6:15 PM, Marina Moran <esperandoelmila...@gmail.com> wrote: > I create the folder named "1" in this folder > /tmp/openmpi-sessions-hpcpro@m112a_0/7859, that already exists, and it > works! > > > > On 10/27/14, Kapil Arya <kapil.arya...@gmail.com> wrote: > > Hi Marina, > > > > Can you restart from the same checkpoint image as in the previous email > > after creating the following directory > > "/tmp/openmpi-sessions-hpcpro@m112a_0/7859/1" > > ? > > > > Apparently, DMTCP is unable to create the directory path and that's why > it > > can't create the file in there. Once we confirm that this is indeed the > > problem, I will try to come up with a fix by tomorrow. > > > > Kapil > > > > On Mon, Oct 27, 2014 at 6:00 PM, Marina Moran > > <esperandoelmila...@gmail.com> > > wrote: > > > >> Hi Kapil! > >> > >> It is the same as before. > >> > >> hpcpro@m112a:~/NPB3.3/NPB3.3-MPI/bin$ ~/dmtcp-trunk/bin/dmtcp_restart > >> ckpt_*.dmtcp > >> [7894] mtcp_restart.c:1310 open_shared_file: > >> unable to create file > >> /tmp/openmpi-sessions-hpcpro@m112a_0/7859/1/shared_mem_pool.m112a: 2 > >> [7892] mtcp_restart.c:1310 open_shared_file: > >> unable to create file > >> /tmp/openmpi-sessions-hpcpro@m112a_0/7859/1/shared_mem_pool.m112a: 2 > >> [7895] mtcp_restart.c:1310 open_shared_file: > >> unable to create file > >> /tmp/openmpi-sessions-hpcpro@m112a_0/7859/1/shared_mem_pool.m112a: 2 > >> [7893] mtcp_restart.c:1310 open_shared_file: > >> unable to create file > >> /tmp/openmpi-sessions-hpcpro@m112a_0/7859/1/shared_mem_pool.m112a: 2 > >> > >> > >> and at the coordinator: > >> [7832] NOTE at dmtcp_coordinator.cpp:1096 in > >> validateRestartingWorkerProcess; REASON='FIRST dmtcp_restart > >> connection. Set numPeers. Generate timestamp' > >> numPeers = 5 > >> curTimeStamp = 22631325571 > >> compId = 1310c956110-40000-544ee9ab > >> [7832] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker > >> connected' > >> hello_remote.from = 1310c956110-40000-544ee9ab > >> [7832] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker > >> connected' > >> hello_remote.from = 1310c956110-41000-544ee9ab > >> [7832] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker > >> connected' > >> hello_remote.from = 1310c956110-42000-544ee9ab > >> [7832] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker > >> connected' > >> hello_remote.from = 1310c956110-43000-544ee9ab > >> [7832] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker > >> connected' > >> hello_remote.from = 1310c956110-44000-544ee9ab > >> [7832] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect; > >> REASON='client disconnected' > >> client->identity() = 1310c956110-43000-544ee9ab > >> [7832] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect; > >> REASON='client disconnected' > >> client->identity() = 1310c956110-41000-544ee9ab > >> [7832] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect; > >> REASON='client disconnected' > >> client->identity() = 1310c956110-44000-544ee9ab > >> [7832] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect; > >> REASON='client disconnected' > >> client->identity() = 1310c956110-42000-544ee9ab > >> l > >> Client List: > >> #, PROG[virtPID:realPID]@HOST, DMTCP-UNIQUEPID, STATE > >> 11, orterun[40000:7881]@m112a, 1310c956110-40000-544ee9ab, CHECKPOINTED > >> > >> > >> On 10/27/14, Kapil Arya <kapil.arya...@gmail.com> wrote: > >> > Hi Marina, > >> > > >> > Could you do the following and then reproduce the error and send us > the > >> > output: > >> > > >> > git clone https://github.com/dmtcp/dmtcp.git dmtcp-trunk > >> > cd dmtcp-trunk > >> > ./configure > >> > make > >> > > >> > Now use this code to run your tests. > >> > > >> > This will pull the latest trunk to allow us to diagnose the error. > >> > > >> > Kapil > >> > > >> > On Mon, Oct 27, 2014 at 8:24 PM, Marina Moran > >> > <esperandoelmila...@gmail.com> > >> > wrote: > >> > > >> >> Hi everyone: > >> >> > >> >> I have a node (intel i5) with 4 cores with: > >> >> Debian jessie amd64 > >> >> OpenMPI 1.6.5 > >> >> DMTCP: 2.3.1 > >> >> NAS benchmarks > >> >> > >> >> My first try is using one node (four processes): > >> >> > >> >> I started the coordinator in one terminal: > >> >> > >> >> hpcpro@m112a:~/NPB3.3/NPB3.3-MPI/bin$dmtcp_coordinator > >> >> > >> >> > >> >> In another terminal I launch the program: > >> >> > >> >> hpcpro@m112a:~/NPB3.3/NPB3.3-MPI/bin$dmtcp_launch mpirun -np 4 > >> lu.A.4 > >> >> > >> >> > >> >> In another terminal I call the checkpoint: > >> >> hpcpro@m112a:~/NPB3.3/NPB3.3-MPI/bin$ dmtcp_command --checkpoint > >> >> > >> >> > >> >> Call the restart script, where it hangs out: > >> >> > >> >> hpcpro@m112a:~/NPB3.3/NPB3.3-MPI/bin$ ./dmtcp_restart_script.sh > >> >> [1057] mtcp_restart.c:1303 open_shared_file: > >> >> unable to create file > >> >> /tmp/openmpi-sessions-hpcpro@m112a_0/16803/1/shared_mem_pool.m112a > >> >> [1058] mtcp_restart.c:1303 open_shared_file: > >> >> unable to create file > >> >> /tmp/openmpi-sessions-hpcpro@m112a_0/16803/1/shared_mem_pool.m112a > >> >> [1060] mtcp_restart.c:1303 open_shared_file: > >> >> unable to create file > >> >> /tmp/openmpi-sessions-hpcpro@m112a_0/16803/1/shared_mem_pool.m112a > >> >> [1059] mtcp_restart.c:1303 open_shared_file: > >> >> unable to create file > >> >> /tmp/openmpi-sessions-hpcpro@m112a_0/16803/1/shared_mem_pool.m112a > >> >> > >> >> > >> >> While the coordinator window show this: > >> >> > >> >> [964] NOTE at dmtcp_coordinator.cpp:1096 in > >> >> validateRestartingWorkerProcess; REASON='FIRST dmtcp_restart > >> >> connection. Set numPeers. Generate timestamp' > >> >> numPeers = 5 > >> >> curTimeStamp = 22631250933 > >> >> compId = 1310c956110-60000-544ed77d > >> >> [964] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker > >> >> connected' > >> >> hello_remote.from = 1310c956110-60000-544ed77d > >> >> [964] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker > >> >> connected' > >> >> hello_remote.from = 1310c956110-61000-544ed77d > >> >> [964] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker > >> >> connected' > >> >> hello_remote.from = 1310c956110-62000-544ed77d > >> >> [964] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker > >> >> connected' > >> >> hello_remote.from = 1310c956110-63000-544ed77d > >> >> [964] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker > >> >> connected' > >> >> hello_remote.from = 1310c956110-64000-544ed77d > >> >> [964] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect; > >> >> REASON='client disconnected' > >> >> client->identity() = 1310c956110-63000-544ed77d > >> >> [964] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect; > >> >> REASON='client disconnected' > >> >> client->identity() = 1310c956110-62000-544ed77d > >> >> [964] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect; > >> >> REASON='client disconnected' > >> >> client->identity() = 1310c956110-64000-544ed77d > >> >> [964] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect; > >> >> REASON='client disconnected' > >> >> client->identity() = 1310c956110-61000-544ed77d > >> >> l > >> >> Client List: > >> >> #, PROG[virtPID:realPID]@HOST, DMTCP-UNIQUEPID, STATE > >> >> 41, orterun[60000:1405]@m112a, 1310c956110-60000-544ed77d, > >> >> CHECKPOINTED > >> >> > >> >> > >> >> I was looking in this foro and internet about this error but can't > get > >> >> any luck. Any help will be very appreciated! > >> >> > >> >> Regards, > >> >> Marina > >> >> > >> >> > >> >> > >> > ------------------------------------------------------------------------------ > >> >> _______________________________________________ > >> >> Dmtcp-forum mailing list > >> >> Dmtcp-forum@lists.sourceforge.net > >> >> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum > >> >> > >> > > >> > > >
------------------------------------------------------------------------------
_______________________________________________ Dmtcp-forum mailing list Dmtcp-forum@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dmtcp-forum