Hello, world!
I'm trying to setup DMTCP to work with Slurm in my environment for my
users.
I'm using DMTCP dmtcp-2.6.1~rc1-0.1.el7.x86_64, which I installed from
the EPEL repo. I'm using Slurm 20.11.3 and OpenMPI 4.0.3. My program is
a simple MPI-based "Hello,world" style program that prints out messages
from each of the MPI ranks every 2 seconds.
I created an sbatch file using the example found at
https://github.com/dmtcp/dmtcp/blob/master/plugin/batch-queue/job_examples/slurm_launch.job.
Basically, I just removed the comments to made it more readable, and
briefer for me to post here:
#!/bin/bash
#SBATCH -p general
#SBATCH -t 00:01:00
#SBATCH -n 2
#SBATCH --mem=1000
#SBATCH --export=ALL
#SBATCH -J dmctp_hello
#SBATCH -o %x-%j.out
#SBATCH -e %x-%j.err
#SBATCH --mail-type=ALL
module load gcc/9.3.0
module load openmpi/4.0.3
# 0. Set up DMTCP environment for a job
start_coordinator()
{
############################################################
# For debugging when launching a custom coordinator, uncomment
# the following lines and provide the proper host and port for
# the coordinator.
############################################################
# export DMTCP_COORD_HOST=$h
# export DMTCP_COORD_PORT=$p
# return
fname=dmtcp_command.$SLURM_JOBID
h=$(hostname)
check_coordinator=$(which dmtcp_coordinator)
if [ -z "$check_coordinator" ]; then
echo "No dmtcp_coordinator found. Check your DMTCP installation
and PATH settings."
exit 0
fi
dmtcp_coordinator --daemon --exit-on-last -p 0 --port-file $fname
$@ 1>/dev/null 2>&1
while true; do
if [ -f "$fname" ]; then
p=$(cat $fname)
if [ -n "$p" ]; then
# try to communicate ? dmtcp_command -p $p l
break
fi
fi
done
# Create dmtcp_command wrapper for easy communication with coordinator
p=$(cat $fname)
chmod +x $fname
echo "#!/bin/bash" > $fname
echo >> $fname
echo "export PATH=$PATH" >> $fname
echo "export DMTCP_COORD_HOST=$h" >> $fname
echo "export DMTCP_COORD_PORT=$p" >> $fname
echo "dmtcp_command \$@" >> $fname
# Set up local environment for DMTCP
export DMTCP_COORD_HOST=$h
export DMTCP_COORD_PORT=$p
export DMTCP_SIGCKPT=10
}
# 1. Start DMTCP coordinator
start_coordinator -i 10
# 2. Launch application
srun /usr/bin/dmtcp_launch --rm ./dmctp_hello
# dmtcp_launch --rm mpirun /u/pbisbal/testing/dmtcp/dmtcp_hello
When I submit the job using srun as shown above, the only output for my
job is this message, written to standard err:
srun: error: ellis002: tasks 0-1: Exited with exit code 99
where ellis002 is the node that was alocated to the job.
When I replace the srun command with this one:
dmtcp_launch --rm mpirun /u/pbisbal/testing/dmtcp/dmtcp_hello
I get a lot more errors:
[40000] WARNING at socketconnection.cpp:222 in TcpConnection;
REASON='JWARNING(false) failed'
type = 2
Message: Datagram Sockets not supported. Hopefully, this is a short
lived connection!
[40000] WARNING at signalwrappers.cpp:141 in sigaction;
REASON='JWARNING(false) failed'
"Application trying to use DMTCP's signal for it's own use.\n" "
You should employ a different signal by setting the\n" " environment
variable DMTCP_SIGCKPT to the number\n" " of the signal that DMTCP
should use for checkpointing." = Application trying to use DMTCP's
signal for it's own use.
You should employ a different signal by setting the
environment variable DMTCP_SIGCKPT to the number
of the signal that DMTCP should use for checkpointing.
stopSignal = 10
[40000] WARNING at socketconnection.cpp:222 in TcpConnection;
REASON='JWARNING(false) failed'
type = 2
Message: Datagram Sockets not supported. Hopefully, this is a short
lived connection!
[41000] WARNING at socketconnection.cpp:222 in TcpConnection;
REASON='JWARNING(false) failed'
type = 2
Message: Datagram Sockets not supported. Hopefully, this is a short
lived connection!
[42000] WARNING at socketconnection.cpp:222 in TcpConnection;
REASON='JWARNING(false) failed'
type = 2
Message: Datagram Sockets not supported. Hopefully, this is a short
lived connection!
[41000] WARNING at socketconnection.cpp:222 in TcpConnection;
REASON='JWARNING(false) failed'
type = 2
Message: Datagram Sockets not supported. Hopefully, this is a short
lived connection!
[42000] WARNING at socketconnection.cpp:222 in TcpConnection;
REASON='JWARNING(false) failed'
type = 2
Message: Datagram Sockets not supported. Hopefully, this is a short
lived connection!
[41000] WARNING at socketconnection.cpp:222 in TcpConnection;
REASON='JWARNING(false) failed'
type = 2
Message: Datagram Sockets not supported. Hopefully, this is a short
lived connection!
[42000] WARNING at socketconnection.cpp:222 in TcpConnection;
REASON='JWARNING(false) failed'
type = 2
Message: Datagram Sockets not supported. Hopefully, this is a short
lived connection!
[41000] ERROR at sysvipcwrappers.cpp:169 in shmctl;
REASON='JASSERT(realShmid != -1) failed'
dmtcp_hello (41000): Terminating...
[42000] ERROR at sysvipcwrappers.cpp:169 in shmctl;
REASON='JASSERT(realShmid != -1) failed'
dmtcp_hello (42000): Terminating...
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero
status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[57709,1],1]
Exit code: 99
--------------------------------------------------------------------------
[40000] WARNING at signalwrappers.cpp:141 in sigaction;
REASON='JWARNING(false) failed'
"Application trying to use DMTCP's signal for it's own use.\n" "
You should employ a different signal by setting the\n" " environment
variable DMTCP_SIGCKPT to the number\n" " of the signal that DMTCP
should use for checkpointing." = Application trying to use DMTCP's
signal for it's own use.
You should employ a different signal by setting the
environment variable DMTCP_SIGCKPT to the number
of the signal that DMTCP should use for checkpointing.
stopSignal = 10
I did a search for "Application trying to use DMTCP's signal for it's
own use", and I found this link:
http://wiki.orc.gmu.edu/mkdocs/Creating_Checkpoints_(DMTCP)/
I added
export DMTCP_SIGCKPT=10
to my sbatch script as they advised (and as shown above), but it didn't
help.
Any suggestions?
--
Prentice
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum