Hi all,

Thanks to Rohan Garg support, the problem is solved now. The issue was
that I was killing the job iwth CTRL+C instead of pressing K in the
coordinator, and that created some issues.

For the sake of completion, below is attached the whole set of tests,
hoping it helps someone on my same sittuation.


Best regards,

Manuel

---
---


    [slurm@localhost dmtcp]$ ./bin/dmtcp_coordinator
    dmtcp_coordinator starting...
        Host: localhost.localdomain (127.0.0.1)
        Port: 7779
        Checkpoint Interval: disabled (checkpoint manually instead)
        Exit on last client: 0
    Type '?' for help.

 - Ran the hellompi program under DMTCP from a separate terminal:


    [slurm@localhost dmtcp]$ ./bin/dmtcp_launch mpirun -n 3
/home/slurm/helloWorldMPI
    Process 2 of 3 is on localhost.localdomain
    Hello world from process 2 of 3
    2: 2
    Process 0 of 3 is on localhost.localdomain
    Hello world from process 0 of 3
    0: 2
    Process 1 of 3 is on localhost.localdomain
    Hello world from process 1 of 3
    ...

 - Checked the processes connected to the coordinator by pressing 'l' at the
   coordinator terminal:

    l <enter>
    Client List:
    #, PROG[virtPID:realPID]@HOST, DMTCP-UNIQUEPID, STATE
    1, mpiexec.hydra[40000:23455]@localhost.localdomain,
5c0352704509e818-40000-555abc19, RUNNING
    2, hydra_pmi_proxy[41000:23459]@localhost.localdomain,
5c0352704509e818-41000-555abc19, RUNNING
    3, helloWorldMPI[42000:23462]@localhost.localdomain,
5c0352704509e818-42000-555abc19, RUNNING
    4, helloWorldMPI[43000:23463]@localhost.localdomain,
5c0352704509e818-43000-555abc19, RUNNING
    5, helloWorldMPI[44000:23464]@localhost.localdomain,
5c0352704509e818-44000-555abc19, RUNNING

  - Entered 'c' to issue a checkpoint command at the coordinator
terminal, and then,
    'l' again to verify that the processes are still running:

    c <enter>
    [23454] NOTE at dmtcp_coordinator.cpp:1291 in startCheckpoint;
REASON='starting checkpoint, suspending all nodes'
         s.numPeers = 5
    [23454] NOTE at dmtcp_coordinator.cpp:1293 in startCheckpoint;
REASON='Incremented Generation'
         compId.generation() = 1
    [23454] NOTE at dmtcp_coordinator.cpp:654 in updateMinimumState;
REASON='locking all nodes'
    [23454] NOTE at dmtcp_coordinator.cpp:660 in updateMinimumState;
REASON='draining all nodes'
    [23454] NOTE at dmtcp_coordinator.cpp:666 in updateMinimumState;
REASON='checkpointing all nodes'
    [23454] NOTE at dmtcp_coordinator.cpp:680 in updateMinimumState;
REASON='building name service database'
    [23454] NOTE at dmtcp_coordinator.cpp:696 in updateMinimumState;
REASON='entertaining queries now'
    [23454] NOTE at dmtcp_coordinator.cpp:701 in updateMinimumState;
REASON='refilling all nodes'
    [23454] NOTE at dmtcp_coordinator.cpp:732 in updateMinimumState;
REASON='restarting all nodes'
    l <enter>
    Client List:
    #, PROG[virtPID:realPID]@HOST, DMTCP-UNIQUEPID, STATE
    1, mpiexec.hydra[40000:23455]@localhost.localdomain,
5c0352704509e818-40000-555abc19, RUNNING
    2, hydra_pmi_proxy[41000:23459]@localhost.localdomain,
5c0352704509e818-41000-555abc19, RUNNING
    3, helloWorldMPI[42000:23462]@localhost.localdomain,
5c0352704509e818-42000-555abc19, RUNNING
    4, helloWorldMPI[43000:23463]@localhost.localdomain,
5c0352704509e818-43000-555abc19, RUNNING
    5, helloWorldMPI[44000:23464]@localhost.localdomain,
5c0352704509e818-44000-555abc19, RUNNING

  - Next, I killed the computation by pressing 'k' at the coordinator terminal:

    k
    [23454] NOTE at dmtcp_coordinator.cpp:588 in handleUserCommand;
REASON='Killing all connected Peers...'
    [23454] NOTE at dmtcp_coordinator.cpp:914 in onDisconnect;
REASON='client disconnected'
         client->identity() = 5c0352704509e818-40000-555abc19
    [23454] NOTE at dmtcp_coordinator.cpp:914 in onDisconnect;
REASON='client disconnected'
         client->identity() = 5c0352704509e818-41000-555abc19
    [23454] NOTE at dmtcp_coordinator.cpp:914 in onDisconnect;
REASON='client disconnected'
         client->identity() = 5c0352704509e818-42000-555abc19
    [23454] NOTE at dmtcp_coordinator.cpp:914 in onDisconnect;
REASON='client disconnected'
         client->identity() = 5c0352704509e818-44000-555abc19
    [23454] NOTE at dmtcp_coordinator.cpp:914 in onDisconnect;
REASON='client disconnected'
         client->identity() = 5c0352704509e818-43000-555abc19

  - Finally, I tried to restart from the checkpoint images, and it worked:

    [slurm@localhost dmtcp]$ ./bin/dmtcp_restart ckpt_*
    [23491] mtcp_restart.c:1204 read_shared_memory_area_from_file:
      mapping /dev/shm/mpich_shar_tmpAEfhnO with data from ckpt image
    [23490] mtcp_restart.c:1204 read_shared_memory_area_from_file:
      mapping /dev/shm/mpich_shar_tmpAEfhnO with data from ckpt image
    0: 7
    2: 7
    1: 7

2015-05-15 18:10 GMT+02:00 Rohan Garg <rohg...@ccs.neu.edu>:
> Thank you! A VM image sounds great! I'm assuming it'll have the
> pre-configured MPICH.
>
> ----- Original Message -----
> From: "Manuel Rodríguez Pascual" <manuel.rodriguez.pasc...@gmail.com>
> To: "Rohan Garg" <rohg...@ccs.neu.edu>
> Cc: "dmtcp-forum" <dmtcp-forum@lists.sourceforge.net>
> Sent: Friday, May 15, 2015 11:49:36 AM
> Subject: Re: checkpointing MPI applications
>
> Hi,
>
> My environment is not fixed at all, so if you have any suggestion I have no
> problem on changing it :) my only requirement is Slurm, but the MPI and
> checkpoint libraries can be modified freely.
>
> Or if you prefer, I can keep this one and help you on the debug process.
> Everything is virtualized, so I can send you the image if you want to use
> it for your tests.
>
> Thanks for your help,
>
>
> Manuel
>
> El viernes, 15 de mayo de 2015, Rohan Garg <rohg...@ccs.neu.edu> escribió:
>
>> Hi Manuel,
>>
>> Sorry for the delayed response. I don't see any obvious problems
>> from the logs that you shared, or your methodology. It seems like
>> DMTCP failed to restore one or more processes on restart.
>>
>> I believe we support MPICH, but I'll have to go back and check if
>> we have regressed. I'll try to reproduce this issue locally and
>> report back to you. Is there anything specific about your run-time
>> environment that I should keep in mind?
>>
>> Thanks,
>> Rohan
>>
>> ----- Original Message -----
>> From: "gene" <g...@ccs.neu.edu <javascript:;>>
>> To: "Manuel Rodríguez Pascual" <manuel.rodriguez.pasc...@gmail.com
>> <javascript:;>>
>> Cc: "dmtcp-forum" <dmtcp-forum@lists.sourceforge.net <javascript:;>>
>> Sent: Thursday, May 14, 2015 11:05:23 AM
>> Subject: Re: [Dmtcp-forum] checkpointing MPI applications
>>
>> Hi Rohan,
>>     Sorry to burden you with this, but with Jiajun and Artem on trips,
>> you're our main expert right now.  Could you answer this one?
>>     Manuel is running MPICH.  He's using DMTCP 2.4.0-rc4, which should
>> be reasonably up to date.  Three processes on one node.  Checkpoint
>> succeeds.  But on restart, he gets the following eror:
>>
>> > [20427] NOTE at dmtcp_coordinator.cpp:1143 in
>> > validateRestartingWorkerProcess; REASON='Computation not in RESTARTING or
>> > CHECKPOINTED state.  Reject incoming computation process requesting
>> > restart.'
>> >      compId = 6db90f3d5a9dd200-40000-55536cf2
>> >      hello_remote.compGroup = 6db90f3d5a9dd200-40000-55536cf2
>> >      minimumState() = WorkerState::RUNNING
>>
>> Rohan,
>>     Even if it turns out that there's no bug and that Manuel was not
>> using the commands correctly, I'd like to consider this a "documentation
>> bug".
>> Our documentation for DMTCP/MPI has changed in recent months, and it's
>> still not properly polished.
>>     Could you also keep in mind where our documentation for running
>> DMTCP with MPI is lacking, and then put up a pull request to improve
>> our documentation?
>>
>> Thanks,
>> - Gene
>>
>>
>> On Thu, May 14, 2015 at 01:02:42PM +0200, Manuel Rodríguez Pascual wrote:
>> > Hi all,
>> >
>> > I am a newbie in DMTCP. I am trying to checkpoint my MPI application but
>> > still haven't been able. I am pretty sure that I'm doing something pretty
>> > obvious wrong, but cannot find the problem on my own.
>> >
>> > Some info,
>> >
>> > - I am employing MPICH 3.1.4, and my DMTCP version is  2.4.0-rc4
>> > - MPICH is working OK. DMTCP is working OK for serial applications too.
>> > - My code is a simple hello world.
>> > - At this moment I am going with 3 mpi tasks running on the same node, to
>> > simplify things
>> > - coordinator is started with "dmtcp_coordinator",  and my application
>> with
>> > "dmtcp_launch mpirun -n 3 /home/slurm/helloWorldMPI". Both are run with
>> > user "slurm".
>> >
>> > - passwordless SSH is enabled. Tested with
>> >
>> > ssh slurm-master  which dmtcp_launch
>> > /usr/local/bin/dmtcp_launch
>> >
>> >
>> > I don't know where the error can be, so I have included quite a lot of
>> > information below to help.
>> >
>> > Thanks for your help,
>> >
>> >
>> > Manuel
>> > ---
>> > ---
>> > COORDINATOR OUTPUT
>> > [root@slurm-master tmp]# dmtcp_coordinator
>> > dmtcp_coordinator starting...
>> >     Host: slurm-master (192.168.1.10)
>> >     Port: 7779
>> >     Checkpoint Interval: disabled (checkpoint manually instead)
>> >     Exit on last client: 0
>> > Type '?' for help.
>> >
>> > [20427] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
>> > connected'
>> >      hello_remote.from = 6db90f3d5a9dd200-20448-55536cf2
>> > [20427] NOTE at dmtcp_coordinator.cpp:864 in onData; REASON='Updating
>> > process Information after exec()'
>> >      progname = mpiexec.hydra
>> >      msg.from = 6db90f3d5a9dd200-40000-55536cf3
>> >      client->identity() = 6db90f3d5a9dd200-20448-55536cf2
>> > [20427] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
>> > connected'
>> >      hello_remote.from = 6db90f3d5a9dd200-40000-55536cf3
>> > [20427] NOTE at dmtcp_coordinator.cpp:855 in onData; REASON='Updating
>> > process Information after fork()'
>> >      client->hostname() = slurm-master
>> >      client->progname() = mpiexec.hydra_(forked)
>> >      msg.from = 6db90f3d5a9dd200-41000-55536cf3
>> >      client->identity() = 6db90f3d5a9dd200-40000-55536cf3
>> > [20427] NOTE at dmtcp_coordinator.cpp:864 in onData; REASON='Updating
>> > process Information after exec()'
>> >      progname = hydra_pmi_proxy
>> >      msg.from = 6db90f3d5a9dd200-41000-55536cf3
>> >      client->identity() = 6db90f3d5a9dd200-41000-55536cf3
>> > [20427] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
>> > connected'
>> >      hello_remote.from = 6db90f3d5a9dd200-41000-55536cf3
>> > [20427] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
>> > connected'
>> >      hello_remote.from = 6db90f3d5a9dd200-41000-55536cf3
>> > [20427] NOTE at dmtcp_coordinator.cpp:855 in onData; REASON='Updating
>> > process Information after fork()'
>> >      client->hostname() = slurm-master
>> >      client->progname() = hydra_pmi_proxy_(forked)
>> >      msg.from = 6db90f3d5a9dd200-42000-55536cf4
>> >      client->identity() = 6db90f3d5a9dd200-41000-55536cf3
>> > [20427] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
>> > connected'
>> >      hello_remote.from = 6db90f3d5a9dd200-41000-55536cf3
>> > [20427] NOTE at dmtcp_coordinator.cpp:855 in onData; REASON='Updating
>> > process Information after fork()'
>> >      client->hostname() = slurm-master
>> >      client->progname() = hydra_pmi_proxy_(forked)
>> >      msg.from = 6db90f3d5a9dd200-43000-55536cf4
>> >      client->identity() = 6db90f3d5a9dd200-41000-55536cf3
>> > [20427] NOTE at dmtcp_coordinator.cpp:855 in onData; REASON='Updating
>> > process Information after fork()'
>> >      client->hostname() = slurm-master
>> >      client->progname() = hydra_pmi_proxy_(forked)
>> >      msg.from = 6db90f3d5a9dd200-44000-55536cf4
>> >      client->identity() = 6db90f3d5a9dd200-41000-55536cf3
>> > [20427] NOTE at dmtcp_coordinator.cpp:864 in onData; REASON='Updating
>> > process Information after exec()'
>> >      progname = helloWorldMPI
>> >      msg.from = 6db90f3d5a9dd200-42000-55536cf4
>> >      client->identity() = 6db90f3d5a9dd200-42000-55536cf4
>> > [20427] NOTE at dmtcp_coordinator.cpp:864 in onData; REASON='Updating
>> > process Information after exec()'
>> >      progname = helloWorldMPI
>> >      msg.from = 6db90f3d5a9dd200-44000-55536cf4
>> >      client->identity() = 6db90f3d5a9dd200-44000-55536cf4
>> > [20427] NOTE at dmtcp_coordinator.cpp:864 in onData; REASON='Updating
>> > process Information after exec()'
>> >      progname = helloWorldMPI
>> >      msg.from = 6db90f3d5a9dd200-43000-55536cf4
>> >      client->identity() = 6db90f3d5a9dd200-43000-55536cf4
>> >
>> > ---
>> > ---
>> >
>> > I press letter C to checkpoint
>> >
>> > ---
>> > ---
>> >
>> > [20427] NOTE at dmtcp_coordinator.cpp:1291 in startCheckpoint;
>> > REASON='starting checkpoint, suspending all nodes'
>> >      s.numPeers = 5
>> > [20427] NOTE at dmtcp_coordinator.cpp:1293 in startCheckpoint;
>> > REASON='Incremented Generation'
>> >      compId.generation() = 1
>> > [20427] NOTE at dmtcp_coordinator.cpp:654 in updateMinimumState;
>> > REASON='locking all nodes'
>> > [20427] NOTE at dmtcp_coordinator.cpp:660 in updateMinimumState;
>> > REASON='draining all nodes'
>> > [20427] NOTE at dmtcp_coordinator.cpp:666 in updateMinimumState;
>> > REASON='checkpointing all nodes'
>> > [20427] NOTE at dmtcp_coordinator.cpp:680 in updateMinimumState;
>> > REASON='building name service database'
>> > [20427] NOTE at dmtcp_coordinator.cpp:696 in updateMinimumState;
>> > REASON='entertaining queries now'
>> > [20427] NOTE at dmtcp_coordinator.cpp:701 in updateMinimumState;
>> > REASON='refilling all nodes'
>> > [20427] NOTE at dmtcp_coordinator.cpp:732 in updateMinimumState;
>> > REASON='restarting all nodes'
>> > ---
>> > ---
>> >
>> >
>> > I cancel the running application with CTRL+C
>> >
>> > ---
>> > ---
>> > COORDINATOR
>> > [20427] NOTE at dmtcp_coordinator.cpp:914 in onDisconnect; REASON='client
>> > disconnected'
>> >      client->identity() = 6db90f3d5a9dd200-41000-55536cf3
>> > [20427] NOTE at dmtcp_coordinator.cpp:914 in onDisconnect; REASON='client
>> > disconnected'
>> >      client->identity() = 6db90f3d5a9dd200-40000-55536cf3
>> >
>> > APPLICATION
>> >
>> > [mpiexec@slurm-master] Sending Ctrl-C to processes as requested
>> > [mpiexec@slurm-master] Press Ctrl-C again to force abort
>> >
>> > Ctrl-C caught... cleaning up processes
>> > [proxy:0:0@slurm-master] HYD_pmcd_pmip_control_cmd_cb
>> > (/root/mpich-3.1.4/src/pm/hydra/pm/pmiserv/pmip_cb.c:885): assert
>> (!closed)
>> > failed
>> > [proxy:0:0@slurm-master] HYDT_dmxu_poll_wait_for_event
>> > (/root/mpich-3.1.4/src/pm/hydra/tools/demux/demux_poll.c:76): callback
>> > returned error status
>> > [proxy:0:0@slurm-master] main
>> > (/root/mpich-3.1.4/src/pm/hydra/pm/pmiserv/pmip.c:206): demux engine
>> error
>> > waiting for event
>> > ^C[mpiexec@slurm-master] HYDT_bscu_wait_for_completion
>> > (/root/mpich-3.1.4/src/pm/hydra/tools/bootstrap/utils/bscu_wait.c:76):
>> one
>> > of the processes terminated badly; aborting
>> > [mpiexec@slurm-master] HYDT_bsci_wait_for_completion
>> > (/root/mpich-3.1.4/src/pm/hydra/tools/bootstrap/src/bsci_wait.c:23):
>> > launcher returned error waiting for completion
>> > [mpiexec@slurm-master] HYD_pmci_wait_for_completion
>> > (/root/mpich-3.1.4/src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:218): launcher
>> > returned error waiting for completion
>> > [mpiexec@slurm-master] main
>> > (/root/mpich-3.1.4/src/pm/hydra/ui/mpich/mpiexec.c:344): process manager
>> > error waiting for completion
>> >
>> > ---
>> > ---
>> >
>> > Everything until now seems to be OK. I get the same output with serial
>> > applications.
>> >
>> > Then I try to restart,
>> >
>> > ---
>> > ---
>> >
>> > APPLICATION
>> > [slurm@slurm-master ~]$ sh /tmp/dmtcp_restart_script.sh
>> > [21164] ERROR at coordinatorapi.cpp:514 in sendRecvHandshake;
>> > REASON='JASSERT(msg.type == DMT_ACCEPT) failed'
>> > dmtcp_restart (21164): Terminating...
>> >
>> > COORDINATOR
>> >
>> > [20427] NOTE at dmtcp_coordinator.cpp:1143 in
>> > validateRestartingWorkerProcess; REASON='Computation not in RESTARTING or
>> > CHECKPOINTED state.  Reject incoming computation process requesting
>> > restart.'
>> >      compId = 6db90f3d5a9dd200-40000-55536cf2
>> >      hello_remote.compGroup = 6db90f3d5a9dd200-40000-55536cf2
>> >      minimumState() = WorkerState::RUNNING
>> > ---
>> > ---
>> >
>> >
>> >
>> >
>> >
>> > --
>> > Dr. Manuel Rodríguez-Pascual
>> > skype: manuel.rodriguez.pascual
>> > phone: (+34) 913466173 // (+34) 679925108
>> >
>> > CIEMAT-Moncloa
>> > Edificio 22, desp. 1.25
>> > Avenida Complutense, 40
>> > 28040- MADRID
>> > SPAIN
>>
>> >
>> ------------------------------------------------------------------------------
>> > One dashboard for servers and applications across Physical-Virtual-Cloud
>> > Widest out-of-the-box monitoring support with 50+ applications
>> > Performance metrics, stats and reports that give you Actionable Insights
>> > Deep dive visibility with transaction tracing using APM Insight.
>> > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
>>
>> > _______________________________________________
>> > Dmtcp-forum mailing list
>> > Dmtcp-forum@lists.sourceforge.net <javascript:;>
>> > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
>>
>>
>>
>> ------------------------------------------------------------------------------
>> One dashboard for servers and applications across Physical-Virtual-Cloud
>> Widest out-of-the-box monitoring support with 50+ applications
>> Performance metrics, stats and reports that give you Actionable Insights
>> Deep dive visibility with transaction tracing using APM Insight.
>> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
>> _______________________________________________
>> Dmtcp-forum mailing list
>> Dmtcp-forum@lists.sourceforge.net <javascript:;>
>> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
>>
>
>
> --
> Dr. Manuel Rodríguez-Pascual
> skype: manuel.rodriguez.pascual
> phone: (+34) 913466173 // (+34) 679925108
>
> CIEMAT-Moncloa
> Edificio 22, desp. 1.25
> Avenida Complutense, 40
> 28040- MADRID
> SPAIN



-- 
Dr. Manuel Rodríguez-Pascual
skype: manuel.rodriguez.pascual
phone: (+34) 913466173 // (+34) 679925108

CIEMAT-Moncloa
Edificio 22, desp. 1.25
Avenida Complutense, 40
28040- MADRID
SPAIN

------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to