I am installed slurm version 15.08.10 and dmtcp version 2.4.4.. When
i execute the job without the SLURM:
export DMTCP_COORD_PORT=7779
export DMTCP_COORD_HOST=headnodeslurm
dmtcp_launch --rm mpirun.openmpi --host nodeslurm1,nodeslurm2 -np 2
/home/tt/lammps-16Feb16/src/lmp_pi < /home/tt/lammps-16Feb16/bench/in.lj
dmtcp_command -s
Coordinator:
Host: localhost
Port: 7779 (default port)
Status...
NUM_PEERS=5
RUNNING=yes
CKPT_INTERVAL=0 (checkpoint manually)
The checkpoints and resume works every time.
However, when i execute a script on slurm, the coodenator get stuck and
just one node stop the execution:
[2983] NOTE at dmtcp_coordinator.cpp:667 in updateMinimumState;
REASON='locking all nodes'
[2983] NOTE at dmtcp_coordinator.cpp:673 in updateMinimumState;
REASON='draining all nodes'
[2983] NOTE at dmtcp_coordinator.cpp:679 in updateMinimumState;
REASON='checkpointing all nodes'
dmtcp_command -s
Coordinator:
Host: localhost
Port: 7779 (default port)
Status...
NUM_PEERS=2
RUNNING=yes
CKPT_INTERVAL=0 (checkpoint manually)
The slurm script is the same comands used on the first test. Please,
in the slurm is needed some different settings ? I alredy tryed the
scripts in rm folder.
Best reagrds.
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum