Great. Now, I know what to fix. I will write you back once I have uploaded
the patch.

Kapil

On Mon, Oct 27, 2014 at 6:15 PM, Marina Moran <esperandoelmila...@gmail.com>
wrote:

> I create the folder named "1" in this folder
> /tmp/openmpi-sessions-hpcpro@m112a_0/7859, that already exists, and it
> works!
>
>
>
> On 10/27/14, Kapil Arya <kapil.arya...@gmail.com> wrote:
> > Hi Marina,
> >
> > Can you restart from the same checkpoint image as in the previous email
> > after creating the following directory
> > "/tmp/openmpi-sessions-hpcpro@m112a_0/7859/1"
> > ?
> >
> > Apparently, DMTCP is unable to create the directory path and that's why
> it
> > can't create the file in there.  Once we confirm that this is indeed the
> > problem, I will try to come up with a fix by tomorrow.
> >
> > Kapil
> >
> > On Mon, Oct 27, 2014 at 6:00 PM, Marina Moran
> > <esperandoelmila...@gmail.com>
> > wrote:
> >
> >> Hi Kapil!
> >>
> >> It is the same as before.
> >>
> >> hpcpro@m112a:~/NPB3.3/NPB3.3-MPI/bin$ ~/dmtcp-trunk/bin/dmtcp_restart
> >> ckpt_*.dmtcp
> >> [7894] mtcp_restart.c:1310 open_shared_file:
> >>   unable to create file
> >> /tmp/openmpi-sessions-hpcpro@m112a_0/7859/1/shared_mem_pool.m112a: 2
> >> [7892] mtcp_restart.c:1310 open_shared_file:
> >>   unable to create file
> >> /tmp/openmpi-sessions-hpcpro@m112a_0/7859/1/shared_mem_pool.m112a: 2
> >> [7895] mtcp_restart.c:1310 open_shared_file:
> >>   unable to create file
> >> /tmp/openmpi-sessions-hpcpro@m112a_0/7859/1/shared_mem_pool.m112a: 2
> >> [7893] mtcp_restart.c:1310 open_shared_file:
> >>   unable to create file
> >> /tmp/openmpi-sessions-hpcpro@m112a_0/7859/1/shared_mem_pool.m112a: 2
> >>
> >>
> >> and at the coordinator:
> >> [7832] NOTE at dmtcp_coordinator.cpp:1096 in
> >> validateRestartingWorkerProcess; REASON='FIRST dmtcp_restart
> >> connection.  Set numPeers. Generate timestamp'
> >>      numPeers = 5
> >>      curTimeStamp = 22631325571
> >>      compId = 1310c956110-40000-544ee9ab
> >> [7832] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker
> >> connected'
> >>      hello_remote.from = 1310c956110-40000-544ee9ab
> >> [7832] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker
> >> connected'
> >>      hello_remote.from = 1310c956110-41000-544ee9ab
> >> [7832] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker
> >> connected'
> >>      hello_remote.from = 1310c956110-42000-544ee9ab
> >> [7832] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker
> >> connected'
> >>      hello_remote.from = 1310c956110-43000-544ee9ab
> >> [7832] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker
> >> connected'
> >>      hello_remote.from = 1310c956110-44000-544ee9ab
> >> [7832] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect;
> >> REASON='client disconnected'
> >>      client->identity() = 1310c956110-43000-544ee9ab
> >> [7832] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect;
> >> REASON='client disconnected'
> >>      client->identity() = 1310c956110-41000-544ee9ab
> >> [7832] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect;
> >> REASON='client disconnected'
> >>      client->identity() = 1310c956110-44000-544ee9ab
> >> [7832] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect;
> >> REASON='client disconnected'
> >>      client->identity() = 1310c956110-42000-544ee9ab
> >> l
> >> Client List:
> >> #, PROG[virtPID:realPID]@HOST, DMTCP-UNIQUEPID, STATE
> >> 11, orterun[40000:7881]@m112a, 1310c956110-40000-544ee9ab, CHECKPOINTED
> >>
> >>
> >> On 10/27/14, Kapil Arya <kapil.arya...@gmail.com> wrote:
> >> > Hi Marina,
> >> >
> >> > Could you do the following and then reproduce the error and send us
> the
> >> > output:
> >> >
> >> >     git clone https://github.com/dmtcp/dmtcp.git dmtcp-trunk
> >> >     cd dmtcp-trunk
> >> >     ./configure
> >> >     make
> >> >
> >> > Now use this code to run your tests.
> >> >
> >> > This will pull the latest trunk to allow us to diagnose the error.
> >> >
> >> > Kapil
> >> >
> >> > On Mon, Oct 27, 2014 at 8:24 PM, Marina Moran
> >> > <esperandoelmila...@gmail.com>
> >> > wrote:
> >> >
> >> >> Hi everyone:
> >> >>
> >> >> I have a node (intel i5) with 4 cores with:
> >> >> Debian jessie amd64
> >> >> OpenMPI 1.6.5
> >> >> DMTCP: 2.3.1
> >> >> NAS benchmarks
> >> >>
> >> >> My first try is using one node (four processes):
> >> >>
> >> >> I started the coordinator in one terminal:
> >> >>
> >> >>     hpcpro@m112a:~/NPB3.3/NPB3.3-MPI/bin$dmtcp_coordinator
> >> >>
> >> >>
> >> >> In another terminal I launch the program:
> >> >>
> >> >>     hpcpro@m112a:~/NPB3.3/NPB3.3-MPI/bin$dmtcp_launch mpirun -np 4
> >> lu.A.4
> >> >>
> >> >>
> >> >> In another terminal I call the checkpoint:
> >> >>     hpcpro@m112a:~/NPB3.3/NPB3.3-MPI/bin$ dmtcp_command --checkpoint
> >> >>
> >> >>
> >> >> Call the restart script, where it hangs out:
> >> >>
> >> >>    hpcpro@m112a:~/NPB3.3/NPB3.3-MPI/bin$ ./dmtcp_restart_script.sh
> >> >>  [1057] mtcp_restart.c:1303 open_shared_file:
> >> >>   unable to create file
> >> >> /tmp/openmpi-sessions-hpcpro@m112a_0/16803/1/shared_mem_pool.m112a
> >> >> [1058] mtcp_restart.c:1303 open_shared_file:
> >> >>   unable to create file
> >> >> /tmp/openmpi-sessions-hpcpro@m112a_0/16803/1/shared_mem_pool.m112a
> >> >> [1060] mtcp_restart.c:1303 open_shared_file:
> >> >>   unable to create file
> >> >> /tmp/openmpi-sessions-hpcpro@m112a_0/16803/1/shared_mem_pool.m112a
> >> >> [1059] mtcp_restart.c:1303 open_shared_file:
> >> >>   unable to create file
> >> >> /tmp/openmpi-sessions-hpcpro@m112a_0/16803/1/shared_mem_pool.m112a
> >> >>
> >> >>
> >> >> While the coordinator window show this:
> >> >>
> >> >> [964] NOTE at dmtcp_coordinator.cpp:1096 in
> >> >> validateRestartingWorkerProcess; REASON='FIRST dmtcp_restart
> >> >> connection.  Set numPeers. Generate timestamp'
> >> >>      numPeers = 5
> >> >>      curTimeStamp = 22631250933
> >> >>      compId = 1310c956110-60000-544ed77d
> >> >> [964] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker
> >> >> connected'
> >> >>      hello_remote.from = 1310c956110-60000-544ed77d
> >> >> [964] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker
> >> >> connected'
> >> >>      hello_remote.from = 1310c956110-61000-544ed77d
> >> >> [964] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker
> >> >> connected'
> >> >>      hello_remote.from = 1310c956110-62000-544ed77d
> >> >> [964] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker
> >> >> connected'
> >> >>      hello_remote.from = 1310c956110-63000-544ed77d
> >> >> [964] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker
> >> >> connected'
> >> >>      hello_remote.from = 1310c956110-64000-544ed77d
> >> >> [964] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect;
> >> >> REASON='client disconnected'
> >> >>      client->identity() = 1310c956110-63000-544ed77d
> >> >> [964] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect;
> >> >> REASON='client disconnected'
> >> >>      client->identity() = 1310c956110-62000-544ed77d
> >> >> [964] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect;
> >> >> REASON='client disconnected'
> >> >>      client->identity() = 1310c956110-64000-544ed77d
> >> >> [964] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect;
> >> >> REASON='client disconnected'
> >> >>      client->identity() = 1310c956110-61000-544ed77d
> >> >> l
> >> >> Client List:
> >> >> #, PROG[virtPID:realPID]@HOST, DMTCP-UNIQUEPID, STATE
> >> >> 41, orterun[60000:1405]@m112a, 1310c956110-60000-544ed77d,
> >> >> CHECKPOINTED
> >> >>
> >> >>
> >> >> I was looking in this foro and internet about this error but can't
> get
> >> >> any luck. Any help will be very appreciated!
> >> >>
> >> >> Regards,
> >> >> Marina
> >> >>
> >> >>
> >> >>
> >>
> ------------------------------------------------------------------------------
> >> >> _______________________________________________
> >> >> Dmtcp-forum mailing list
> >> >> Dmtcp-forum@lists.sourceforge.net
> >> >> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
> >> >>
> >> >
> >>
> >
>
------------------------------------------------------------------------------
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to